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Over  the  past  several  months  we  have  broadened  the  scope  of  our  investigation 
into  the  lexical  and  phonetic  constraints  imposed  by  the  English  language.  At  the 
segmental  level,  we  investigated  the  effects  of  segmentation  and  classification 
uncertainties,  both  of  which  are  likely  to  occur  in  the  actual  implementation  of  a 
phonetic  front-end.  The  results  of  our  experiments  indicate  that,  if  the  uncertainties 
and  errors  are  reasonable,  the  constraints  can  still  be  very  powerful.  We  have  also 
conducted  a  number  of  experiments  examining  the  functional  loads  carried  by 
segments  in  stressed  versus  unstressed  syllables.  We  found  that  the  stressed 
syllables  provide  a  significantly  greater  amount  of  constraining  power  than  unstressed 
syllables.  This  implies  that  in  the  actual  implementation,  acoustic-phonetic 
information  around  unstressed  syllable  should  not  receive  undue  emphasis.  At  the 
prosodic  level,  we  started  to  investigate  the  constraints  imposed  by  the  stress  pattern 
of  words.  Preliminary  results  indicate  that  knowledge  about  the  stress  pattern,  or 
simply  the  relative  position  of  the  syllable  with  primary  stress,  greatly  constrains  the 
number  of  word  candidates. 


Implementation  of  the  large-vocabulary,  isolated-word  recognition  system  has 
progressed  in  several  directions.  First,  the  performance  of  the  acoustic  classifier  was 
evaluated  on  the  speech  data  from  two  speakers,  one  male  and  one  female.  After 
minor  modifications,  we  feel  that  its  performance,  including  edge  detection  and 
parameter  characterization,  is  quite  satisfactory,  although  some  of  the  rules  are  still 
not  adequate.  Second,  a  software  system,  called  TRANSCRIBE,  has  been  written. 
This  is  an  interactive  system  that  allows  researchers  to  write  acoustic- phonetic  rules 
and  evaluate  their  performance  on  a  database.  The  system  has  the  capability  of 
explaining  the  history  of  how  a  rule  has  been  triggered,  as  well  as  why  certain  rules 
failed  to  apply.  This  facility,  together  with  the  speech  data  that  we  have  collected 
previously,  has  greatly  improved  our  ability  to  specify  and  debug  acoustic-phonetic 
rules.  As  a  consequence,  we  expect  that  we  will  be  able  to  complete  the  broad 
acoustic- phonetic  classifier  within  the  next  several  months.  Third,  we  have 
implemented  a  system  that  locates  the  stressed  and  reduced  syllables  of  a  word. 
Thus,  for  example,  the  system  can  determine  that  the  first  syllable  of  the  word 
"institute”  is  stressed,  whereas  the  second  syllable  is  reduced. 

The  continuous  digit  recognition  system  has  been  implemented  up  to  the  level  of 
lexical  access,  and  we  have  just  completed  our  first  round  of  evaluation.  The  system 
was  developed  using  the  speech  data  from  one  male  speaker,  and  the  initial 
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evaluation  was  performed  using  some  two  hundred  digits  spoken  by  three  new 
speakers,  one  male  and  two  female  We  found  that  the  error  rate  is  approximately  1%. 
That  is.  the  correct  digit  is  not  one  of  the  candidates  in  less  than  tu-0  of  the  time.  The 
corresponding  depth  of  the  digit  lattice  is  approximately  3.  While  these  results  are 
very  preliminary,  we  are  nevertheless  encouraged  and  feel  that  this  may  be  a  viable 
approach  to  speaker- independent  digit  recognition.  We  are  continuously  refining  the 
system,  and  we  expect  to  make  another  performance  evaluation  in  the  near  future, 
this  time  over  a  larger  database.  In  addition,  we  have  also  started  to  collect  data  on 
other  languages  (Japanese.  French,  and  Italian)  for  the  digit  task.  Digits  strings  for 
these  languages  have  been  recorded  and  digitized.  Spectrograms  were  made,  and 
the  acoustic-phonetic  rules  appropriate  for  the  language  in  question  are  specified 
both  within  a  digit  and  across  digit  boundaries.  We  expect  to  evaluate  the 
performance  of  the  basic  digit  recognition  system  for  different  languages,  thus 
determining  the  effectiveness  of  the  rules,  in  the  next  quarter. 

We  are  continuing  our  effort  to  find  cues  to  delineate  words  in  continuous  speech. 
It  is  well  known  that  words  in  continuous  speech  are  not  separated  by  pauses.  In 
some  cases,  the  acoustic  characteristics  can  be  significantly  different,  depending  on 
the  location  of  the  word  boundary.  Thus,  for  example,  the  acoustic  properties  of 
phrases  "nitrate”  and  "night  rate"  may  be  quite  different.  Before  investigating  the 
possible  acoustic  differences  between  such  phrases,  we  first  investigated  the 
distributional  constraints  imposed  by  the  English  language.  We  asked  the  following 
question:  Given  a  consonant  sequence,  can  one  determine  whether  this  sequence 
can  only  occur  at  word  boundaries?  Using  text  files  ranging  from  200  to  38,000  word, 
we  found  that,  one  the  average.  80%  of  consonant  sequences  found  can  only  occur  at 
word  boundaries,  in  other  words,  only  one  out  of  five  consonant  sequences  can 
occur  word  internally  as  well  as  across  word  boundaries.  Thus  given  an  ideal 
phonetic  transcription,  the  word  boundary  can  be  determined  uniquely  most  of  the 
time.  Further  studies  along  this  line  will  continue  in  the  next  quarter. 

The  alignment  of  a  speech  signal  with  its  corresponding  phonetic  transcription  is 
an  essential  process  in  speech  research,  since  the  time-aligned  transcription  provides 
direct  access  to  specific  phonetic  events  in  the  signal.  Traditionally,  the  alignment  is 
done  manually  by  a  trained  acoustic  phonetician.  The  task,  however,  is  prone  to 
error,  tedious  and  extremely  time  consuming.  During  the  past  six  month  we  initiated 
an  effort  to  develop  a  system  that  performs  the  time-alignment  automatically.  The 
alignment  is  achieved  using  a  standard  pattern  classification  algorithm  and  a  dynamic 
programming  algorithm,  augmented  with  acoustic -phonetic  constraints.  In  initial 
implementation  of  the  system  has  been  completed.  We  will  refine  some  of  its 
components  and  perform  formal  evaluation  during  the  next  quarter. 

Extensive  support  software  of  SPIRE.  SPIREX,  and  LEXIS  has  been  written.  These 
three  programs  have  now  been  field  tested  in  a  number  of  research  laboratories 


around  the  country,  including  several  defense  contractors  and  installations.  With  the 
arrival  of  the  Symbolics-3600  Lisp  machine,  programs  have  been  converted  such  that 
they  now  run  on  all  the  Lisp  machines.  We  continue  to  increase  the  amount  of  speech 
data.  We  now  have  more  than  one  hour  of  digitized  speech  available  on  line. 
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1.  Introduction 

Automatic  speech  recognition  by  machine  is  a 
research  topic  that  has  fascinated  many  speech 
scientists  for  more  than  forty  years.  For  many,  it 
represents  the  ultimate  challenge  to  our  under¬ 
standing  of  the  production  and  perception 
processes  of  human  communication.  However,  the 
last  decade  has  witnessed  a  flourish  of  research 
efforts  in  the  development  of  speaker-dependent, 
small-vocabulary,  isolated-word  recognition  sys¬ 
tems  that  utilize  little  or  no  speech-specific  knowl¬ 
edge.  These  systems  derive  their  power  primarily 
from  general-purpose  pattern  recognition  tech¬ 
niques.  While  these  techniques  are  adequate  for  a 
small  class  of  well-constrained  speech  recognition 
problems,  their  extendibility  to  task;  involving 
multiple  speakers,  larger  vocabularies  and/or  con¬ 
tinuous  speech  is  questionable. 

Reliance  on  general  pattern  matching  tech¬ 
niques  has  been  partly  motivated  by  the  unsatis¬ 
factory  performance  of  early  phonetically-based 
speech  recognition  systems.  The  difficulty  of  auto¬ 
matic  acoustic-phonetic  analysis  has  also  led  to  the 
speculation  that  phonetic  information  must  be  de¬ 
rived  primarily  from  semantic,  syntactic  and  dis¬ 
course  constraints  rather  than  from  the  acoustic 
signal.  The  poor  performance  of  the  early  phoneti¬ 
cally-based  systems  can  be  attributed  mainly  to 
our  limited  knowledge  of  the  context-dependency 
of  the  acoustic  characteristics  of  speech  sounds. 
However,  this  picture  is  slowly  changing.  We  now 
have  a  far  better  understanding  of  contextual  in¬ 
fluences  on  phonetic  segments.  This  improved  un¬ 
derstanding  has  been  demonstrated  in  a  series  of 
spectrogram  reading  experiments  (Zue  and 
Cole,  1979;  Cole  et  al..  1980;  Cole  and  Zue  1980). 


It  was  found  that  a  trained  subject  can  phoneti¬ 
cally  transcribe  unknown  sentences  from  speech 
spectrograms  with  an  accuracy  of  approximately 
85%.  This  performance  is  better  than  the  phonetic 
recognizers  reported  in  the  literature,  both  in  accu¬ 
racy  and  rank  order  statistics.  It  was  also  demon¬ 
strated  that  the  process  of  spectrogram  reading 
makes  use  of  explicit  acoustic  phonetic  rules,  and 
that  this  skill  can  be  learned  by  others.  These 
results  suggest  that  the  acoustic  signal  is  rich  in 
phonetic  information,  which  should  permit  sub¬ 
stantially  better  performance  in  automatic  phonetic 
recognition. 

One  of  the  most  important  factors  contributing 
to  the  good  performance  of  spectrogram  reading  is 
our  improved  understanding  of  the  acoustic  char¬ 
acteristics  of  fluent  speech.  To  be  sure,  there  has 
been  pngoing  research  on  the  acoustic  properties 
of  speech  sounds  over  the  past  few  decades,  and  a 
great  deal  of  knowledge  has  been  acquired.  How¬ 
ever,  with  few  exceptions,  these  research  efforts 
have  been  focused  on  the  acoustic  properties  of 
consonants  and  vowels  in  stressed  consonant-vowel 
syllables.  It  was  not  until  the  past  decade  that 
researchers  began  to  focus  on  the  acoustic  char¬ 
acteristics  of  speech  sounds  in  continuous  speech. 
We  now  have  a  much  better  understanding  of  the 
properties  of  speech  sounds  in  different  phonetic 
environments  [see,  for  example.  Umeda  (1975). 
Kameny  (1975).  Klatt  (1975),  Zue  (1976).  Umeda 
(1977)].  Furthermore,  we  are  beginning  to  develop 
a  quantitative  understanding  of  the  phonological 
processes  governing  the  concatenation  of  words 
[see,  for  example,  Oshika  et  al.  (1975).  Cohen  and 
Mercer  (1975)].  A  few  of  the  effects  have  even 
been  studied  in  detail  (Zue  and  Laferriere.  1979: 
Zue  and  Shattuck-Hufnagei.  1980).  In  addition,  as 
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a  consequence  of  studies  on  the  properties  of 
speech  sounds  and  of  the  auditory  responses  to 
speech-like  sounds  [see,  for  example,  Kiang 
(1980)],  we  are  gaining  better  insight  into  how  the 
speech  signal  is  processed  in  the  auditory  system, 
what  portions  of  the  signal  carry  the  principal 
information  concerning  distinctive  phonetic  di¬ 
mensions,  and  what  portions  show  more  variabil¬ 
ity  with  respect  to  these  dimensions.  For  example, 
the  role  of  the  burst  spectra,  burst  amplitudes,  and 
rapid  onsets  and  offsets  in  identifying  place  of 
articulation  and  other  features  for  stop  consonants 
have  been  documented  (Zue,  1976;  Blumstein  and 
Stevens,  1979). 

In  summary,  we  must  emphasize  that  our  abil¬ 
ity  to  extract  a  great  deal  of  phonetic  information 
from  the  acoustic  signal  is  primarily  a  reflection  of 
our  improved  understanding  of  the  factors  that 
contribute  to  the  phonetic  identities  of  speech 
sounds  and  their  acoustic  correlates.  Spectrogram 
reading  is  nothing  more  than  a  paradigm  to  dem¬ 
onstrate  how  the  acoustic  cues  for  phonetic  con¬ 
trasts  are  encoded  in  the  speech  signal.  Native 
speakers  of  a  language  demonstrate  this  ability 
whenever  they  communicate  by  voice.  In  the  re¬ 
mainder  of  this  paper,  we  will  discuss  how  phonetic 
information  is  encoded  in  the  speech  signal.  We 
will  also  present  some  alternative  ways  to  repre¬ 
sent  such  information  in  speech  recognition  sys¬ 
tems. 


2.  Plw—ric  variability  in  the  speech  signal 

Phonotactic  constraints 

The  speech  signal  is  the  output  of  a  highly 
constrained  system.  In  addition  to  having  a  very 
limited  inventory  of  possible  phonemes,  a  given 
language  is  also  constrained  with  regard  to  the 
ways  in  which  these  phonemes  can  combine  to 
form  meaningful  words.  Knowledge  about  such 
constraints  is  implicitly  possessed  by  native 
speakers  of  a  language.  For  example,  a  native 
English  speaker  knows  that  ‘  vnuk’  is  not  an  En¬ 
glish  word.  He/she  also  knows  that  if  an  English 
word  starts  with  three  consonants,  then  the  first 
consonant  must  be  an  /%/,  and  the  second  conso¬ 
nant  must  be  a  voiceless  stop  (i.e.  either  /p/,  /t/. 


or  /k/).  Such  phonotactic  knowledge  is  presuma¬ 
bly  very  useful  in  speech  communication,  since  it 
provides  native  speakers  with  the  ability  to  fill  in 
phonetic  details  that  are  otherwise  not  available  or 
are  distorted.  Thus,  as  an  extreme  example,  a  word 
such  as  ‘splint’  can  be  recognized  without  having 
to  specify  the  detailed  phonetic  features  of  the 
phonemes.  In  fact,  ‘splint’  is  one  of  only  two 
words  in  the  Merriam  Pocket  Dictionary  (con¬ 
taining  about  20000  words)  that  satisfies  the  fol¬ 
lowing  description: 

[consonant] [consonant] [liquid  or  glide] 

[vowel][nasal][stop]. 

While  the  existence  of  phonotactic  constraints 
is  well  known,  a  recent  set  of  studies  (Shipman 
and  Zue,  1982;  Huttenlocher  and  Zue,  1983)  pro¬ 
vides  a  glimpse  of  the  magnitude  of  their  predic¬ 
tive  power.  These  studies  examine  the  phonotactic 
constraints  of  American  English  from  the 
phonemic  distributions  in  the  20000-word  Mer¬ 
riam  Webster’s  Pocket  Dictionary.  In  one  study 
the  phonemes  of  each  word  were  mapped  into  one 
of  six  broad  phonetic  categories:  vowels,  stops, 
nasals,  liquids  and  glides,  strong  fricatives,  and 
weak  fricatives.  Thus,  for  example,  the  word 
‘speak’,  with  a  phonemic  string  given  by  /spik/.  is 
represented  as  the  pattern: 

[strong  fricative][stop][vowel][stop] . 

It  was  found  that,  even  at  this  broad  phonetic 
level,  approximately  j  of  the  words  in  a  20000- 
word  lexicon  can  be  uniquely  specified.  In  general 
the  size  of  the  equivalence  class,  (namely,  the  num¬ 
ber  of  words  sharing  the  same  pattern),  was  quite 
small.  The  average  size  of  the  equivalence  classes 
for  the  20000-word  lexicon  was  found  to  be  ap¬ 
proximately  2.  and  the  maximum  size  was  ap¬ 
proximately  200.  In  other  words,  in  the  worst  case, 
a  broad  phonetic  representation  of  the  words  in  a 
large  lexicon  reduces  the  number  of  possible  word 
candidates  to  about  1%  of  *V;  lexicon.  Further¬ 
more,  over  half  of  the  lexical  items  belong  to 
equivalence  classes  of  size  S  or  less. 

AUophonic  and  phonological  variations 

When  speech  sounds  are  connected  to  form 
larger  linguistic  units,  the  canonical  acoustic  char- 


V.  W.  Zue  /  Phonetic  rules  in  A.S.R. 


183 


£ 


acteristics  of  a  given  speech  sound  will  change  as  a 
function  of  its  immediate  phonetic  environment. 
As  an  illustrative  example,  consider  the  utterance, 
“Tom  Burton  tried  to  steal  a  butter  plate,"  shown 
in  Fig.  1.  Every  word,  except  ‘a’,  in  this  sentence 
contains  a  single  occurrence  of  the  phoneme  /*/■ 
However,  depending  upon  the  immediate  phonetic 
environment  and  stress  pattern,  the  underlying 
/t/’s  are  realized  alternatively  as  an  aspirated  /t/ 
(‘Tom’),  an  unaspirated  /t/  (‘steal’),  a  retroflexed 
/t/  with  extended  aspiration  (‘tried’),  an  unre¬ 
leased  /t/  (‘plate’),  a  flap  (‘butter’),  or  a  glottal 
stop  (‘burton’).  Hie  acoustic  characteristics  of  these 
realizations  are  seen  to  be  drastically  different. 

The  modification  of  the  Acoustic  properties  of 
speech  sounds  as  a  function  of  the  phonetic  en¬ 
vironment  is  not  a  phenomenon  that  is  restricted 
to  be  within  a  word.  When  words  are  concatenated 
to  form  phrases  and  sentences,  significant  acoustic 
changes  can  result,  as  evidenced  in  the  following 
example.  Fig.  2  shows  a  spectrogram  of  the  seven 
words  ‘did’,  ‘you’,  ‘meet’,  ‘her’,  ‘on’,  ‘this’,  and 
‘ship’,  spoken  in  isolation  as  well  as  in  a  sentence. 
“Did  you  meet  her  on  this  ship?”  We  can  see,  for 
example,  that  the  word-final  /d/  and  the  word- 
initial  /y/  in  the  word  pair  ‘did  you’  are  realized 
acoustically  as  a  single  /]/;  the  word-final  /*/ 
and  the  word-initial  /h/  in  the  word  pair  ‘meet 
her’  are  realized  as  a  single  flap;  and  the  word-fi¬ 
nal  /%/  and  the  word-initial  /$/  in  the  word  pair 
‘this  ship’  are  realized  as  a  single,  long  /l/.  Such 
phonetic  changes  at  word  boundaries,  particularly 
when  there  are  adjacent  word-final  and  word-ini¬ 
tial  consonants,  are  extremely  common  in  Ameri¬ 
can  English.  In  order  to  properly  perform  lexical 
access,  the  nature  of  these  phonological  rules  must 
be  understood. 


3.  Representation  of  phonetic  knowledge 

Even  though  the  acoustic  realizations  of 
phonetic  segments  are  highly  context-sensitive, 
most  of  the  variations,  such  as  the  ones  illustrated 
in  Figs.  1  and  2.  are  systematic  and  can  be  cap¬ 
tured  by  explicit  rules.  (For  example,  /t/  becomes 
a  glottal  stop  [?]  when  preceded  by  a  stressed 
vowel  and  followed  by  a  syllabic  nasal  [n],  as  in 
‘Burton’.)  Over  the  past  decade,  research  in  fluent 


speech  has  enabled  us  to  gain  a  good  under¬ 
standing  of  the  nature  of  these  rules  and  how  they 
interact.  Although  our  present  knowledge  of  the 
inventory  of  these  rules  is  still  incomplete,  such 
knowledge,  however  fragmented,  must  be  incorpo¬ 
rated  into  a  speech  recognition  system  so  that 
words  can  be  recognized  from  seemingly  ambigu¬ 
ous  acoustic  signals. 

In  order  to  discuss  how  acoustic  phonetic 
knowledge  should  be  represented  in  speech  recog¬ 
nition  systems,  it  is  perhaps  useful  to  distinguish 
three  types  of  phonetic/phonological  rules.  First, 
there  are  the  phonotactic  constraints  governing  the 
allowable  combination  of  phonemes.  For  example, 
the  homorganic  rule  in  English  specifies  that  a 
syllable-final  nasal/stop  cluster  must  agree  in  the 
place  of  articulation.  It  should  be  noted  that  the 
obligatory  nature  of  the  phonotactic  constraints 
makes  them  more  suited  to  categorical  formula¬ 
tions. 

Second,  there  are  the  rules  that  describe  the 
modification  of  acoustic  characteristics  of 
phonemes  in  various  phonetic  environments.  These 
allophonic  rules  are  again  mostly  categorical.  How¬ 
ever,  their  acoustic  consequences  may  take  on  a 
continuum  of  values.  For  example,  in  American 
English  the  phoneme  /t/  becomes  a  retroflexed 
alveolar  voiceless  stop  when  preceding  a  retro¬ 
flexed  consonant  or  vowel.  On  the  other  hand,  the 
amount  of  the  acoustic  change,  such  as  the  lower¬ 
ing  of  the  burst  frequency  and  the  lengthening  of 
the  voice  onset  time,  may  vary  over  a  wide  range. 
Traditionally,  allophonic  variations  have  been  con¬ 
sidered  one  of  the  major  sources  of  difficulty  for 
speech  recognition,  since  they  represent  undesira¬ 
ble  distortion,  or  noise,  imposed  on  the  canonic 
characteristics  of  the  phonemes.  However,  re¬ 
searchers  are  beginning  to  see  such  allophonic 
variations  as  a  source  of  information  [see.  for 
example,  Nakatani  and  Dukes  (1977)].  For  exam¬ 
ple,  Church  (1983)  demonstrated  that  detailed 
knowledge  about  allophonic  rules  can  be  exploited 
during  lexical  access  by  parsing  the  phonetic  string 
into  syllables  and  other  suprasegmental  con¬ 
stituents. 

Allophonic  rules  traditionally  have  been  de¬ 
scribed  in  the  context-dependent  formalism:  A  =■» 
B/CD  (i.e.,  segment  A  becomes  segment  B  in  the 
context  of  segments  C  and  D).  There  are  several 
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drawbacks  with  such  a  description.  The  rules  as¬ 
sume  that  the  output  units  are  discrete,  whereas 
the  variations  seem  to  take  on  a  continuous  range 
of  values.  In  addition,  the  application  of  context- 
dependent  rules  often  requires  the  specification  of 
the  correct  context,  a  process  that  can  be  prone  to 
error.  (For  example,  the  identification  of  a  retro- 
flexed  /t/  in  the  word  ‘tree’  depends  upon  cor¬ 
rectly  identifying  the  following  retroflexed  conso¬ 
nant  /r/>.  Finally  there  is  no  definitive  agreement 
among  researchers  regarding  the  constituent  units 
used  to  describe  such  variations,  be  it  segment, 
syllable,  or  metrical  foot. 

The  third  type  of  phonetic  rules  specifies  the 
optional  realizations  of  a  particular  phonetic  string. 
Most  of  the  low-level  phonological  rules  that  de¬ 
scribe  alternate  pronunciations  of  words  fall  into 
this  category.  These  rules  specify,  for  example, 
that  a  word  such  as  4  international'  can  have  many 
pronunciations  including  the  deletion  of  /t/ 
and/or  the  deletion  of  the  penultimate  schwa. 
Traditionally,  this  problem  is  solved  by  expanding 
the  lexicon,  and  the  possible  word  combinations, 
with  phonological  rules  to  include  all  possible 
pronunc  _ ..ons  (Cohen  and  Mercer,  1975;  Woods 
and  Zue,  1976).  This  approach  also  has  some 
drawbacks.  For  example,  dictionary  expansion 
does  not  capture  the  nature  of  phonetic  variability, 
namely  that  certain  segments  of  a  word  are  highly 
variable  while  others  are  relatively  invariant.  It  is 
also  difficult  to  assign  a  likelihood  measure  to 
each  of  the  pronunciations.  Finally,  storing  all 
alternate  pronunciations  is  computationally  expen¬ 
sive,  since  the  size  of  the  lexicon  can  increase 
substantially. 

It  is  interesting  to  note  that,  for  American 
English  at  least,  most  of  the  phonological  rules 
tend  to  apply  to  unstressed  syllables.  Since  the 
acoustic  cues  for  phonetic  segments  around  uns¬ 
tressed  syllables  are  usually  far  less  reliable  than 
around  stressed  syllables  [see.  for  example.  Cutler 
and  Foss  (1977)],  one  may  ask  whether  detailed 
knowledge  of  the  various  pronunciations  are  nec¬ 
essary  for  speech  recognition.  A  recent  study  con¬ 
ducted  by  Huttenlocher  and  Zue  (1983)  indicates 
that  phonetic  segments  within  unstressed  syllables 
provide  little  constraint  for  lexical  access.  Like  in 
Shipman  and  Zue  (1982),  the  phonemes  were 
mapped  into  broad  phonetic  classes,  except  this 


time  the  mapping  was  done  only  for  phonemes 
within  stressed  syllables.  The  entire  unstressed  syl¬ 
lable  was  mapped  into  a  ‘place  holder  symbol. 
[*].  Thus,  for  example,  the  word 
‘spectrogram’/spcktrograem/  is  represented  by  the 
pattern: 

[strong  fricative][stop][vowel][stop][  *  ] 

[stop] [liquid  or  glide] [vowel] [nasal]. 

It  was  found  that  such  representation  still  pro¬ 
vided  powerful  constraints  for  lexical  access.  These 
results  suggest  that  low-level  phonological  varia¬ 
tions  may  be  handled  by  ‘wild-carding1  the  uns¬ 
tressed  syllables  where  the  functional  load  carried 
by  the  phonetic  segments  may  be  minimal. 

4.  Summary 

In  our  view  the  speech  signal  is  the  output  of  a 
highly  constrained  production  mechanism.  The  de¬ 
coding,  or  recognition,  of  sentences  involves  the 
proper  utilization  of  constraints  at  various  levels, 
including  acoustic-phonetic,  phonological,  lexical, 
syntactic,  and  semantic.  In  this  paper,  we  dis¬ 
cussed  the  types  of  constraints  that  exist  at  the 
phonetic  and  phonological  levels,  and  demon¬ 
strated  that  such  constraint  must  be  captured  in 
an  automatic  speech  recognition  system. 

Over  the  past  two  decades,  we  have  made  sig¬ 
nificant  improvements  in  our  qualitative  under¬ 
standing  of  the  phonetic  and  phonological  con¬ 
straints.  However,  there  still  remains  a  great  deal 
of  work  that  needs  to  be  done.  First,  we  need  to 
study  a  sufficient  amount  of  data  so  that  these 
phenomena  can  be  quantified.  Second,  a  formal 
mechanism  for  describing  these  constraints,  both 
in  terms  of  the  proper  units  and  the  proper  gram¬ 
mar,  must  be  devised.  Finally,  the  interaction  of 
these  constraints  at  different  levels  must  be  pro¬ 
posed,  tested,  and  incorporated.  From  a  functional 
standpoint,  there  exist  a  variety  of  ways  to  handle 
the  phonetic  variability,  including  the  use  of  prob¬ 
abilistic  modeling.  However,  if  we  view  the  speech 
recognition  problem  as  one  of  constructing  a  com¬ 
putational  model  for  speech  perception,  then  the 
identification,  quantification,  and  formulation  of 
these  phonetic  rules  is  a  task  that  the  research 
community  must  collectively  undertake.  It  is  only 
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with  such  a  long  range  research  effort  that  we  can. 
one  day,  hope  to  construct  speech  recognition 
systems  with  capabilities  approaching  that  of  hu¬ 
mans. 
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During  the  past  decade,  significant  advances  have  been  made  in 
the  field  of  isolated  word  recognition  (IWR) .  In  many  instances, 


transitions  from  research  results  to  practical  implementations  have 


taken  place.  Today,  speech  recognition  systems  that  can  recognize  a 


small  set  of  isolated  words,  say  50,  for  a  given  speaker  with  an  error 


rate  of  less  than  5%  appear  to  be  relatively  common.  Most  of  the 
current  systems  utilize  little  or  no  speech-specific  knowledge,  but 
derive  their  power  from  general-purpose  pattern  recognition 
techniques.  The  success  of  these  systems  can  at  least  in  part  be 


attributed  to  the  introduction  of  novel  parametric  representations 


(Makhoul,  1975),  distance  metrics  (Itakura,  1975),  and  the  very 


powerful  time  alignment  procedure  of  dynamic  programming  (Sakoe  and 


Chiba,  1971). 


While  we  have  clearly  made  significant  advances  in  dealing  with  a 
small  portion  of  the  speech  recognition  problem,  there  is  serious 


doubt  regarding  the  extendibility  of  the  pattern  matching  approach  to 
tasks  involving  multiple  speakers,  large  vocabularies  and/or 
continuous  speech.  One  of  the  limitations  of  the  template  matching 
approach  is  that  both  computation*  and  storage  grow  (essentially) 
linearly  with  the  size  of  the  vocabulary.  When  the  size  of  the 


vocabulary  is  very  large,  e.g.,  over  10,000  words,  the  computation  and 


storage  requirements  associated  with  current  IWR  systems  become 
prohibitively  expensive.  Sven  if  the  computational  cost  were  not  an 


i**ue,  the  performance  off  these  IWR  systems  for  a  large  vocabulary 
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would  surely  deteriorate  (Keilin  et  al.,  1981).  Furthermore,  as  the 
size  of  the  vocabulary  grows,  it  becomes  imperative  that  such  systems 
be  able  to  operate  in  a  speaker-independent  mode,  since  training  of 
the  system  for  each  user  will  take  too  long. 

This  paper  proposes  a  new  approach  to  large-vocabulary,  isolated 
word  recognition  which  combines  detailed  acoustic-phonetic  knowledge 
with  constraints  on  the  sound  patterns  imposed  by  the  language.  The 
proposed  system  draws  on  the  results  of  two  sets  of  studies;  one 
demonstrating  the  richness  of  phonetic  information  in  the  acoustic 
signal  and  the  other  demonstrating  the  power  of  structural  constraints 
imposed  by  the  language. 

Spectrogram  Reading  Reliance  on  general  pattern  matching 
techniques  has  been  partly  motivated  by  the  unsatisfactory  performance 
of  early  phonetically-based  speech  recognition  systems.  The 
difficulty  of  automatic  acoustic-phonetic  analysis  has  also  led  to  the 
speculation  that  phonetic  information  must  be  derived,  in  large  part, 
from  semantic,  syntactic  and  discourse  constraints  rather  than  from 
the  acoustic  signal.  For  the  most  part,  the  poor  performance  of  these 
phonetically-based  systems  can  be  attributed  to  the  fact  that  our 
knowledge  of  the  context-dependency  of  the  acoustic  characteristics  of 
speech  sounds  was  very  limited  at  the  time.  However,  this  picture  is 
slowly  changing.  Me  now  have  a  far  better  understanding  of  contextual 
influences  on  phonetic  segments.  This  improved  understanding  has  been 
demonstrated  in  a  series  of  spectrogram  reading  experiments  (Cole  et 
al.  1980).  It  was  found  that  a  trained  subject  can  phonetically 
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transcribe  unknown  sentences  from  speech  spectrograms  with  an  accuracy 
of  approximately  85%.  This  performance  is  better  than  the  phonetic 
recognizers  reported  in  the  literature,  both  in  accuracy  and  rank 
order  statistics.  It  was  also  demonstrated  that  the  process  of 
spectrogram  reading  makes  use  of  explicit  acoustic  phonetic  rules,  and 
that  this  skill  can  be  learned  by  others.  These  results  suggest  that 
the  acoustic  signal  is  rich  in  phonetic  information,  which  should 
permit  substantially  better  performance  in  automatic  phonetic 
recognition . 


However,  even  with  a  substantially  improved  knowledge  base,  a 
completely  bottom-up  phonetic  analysis  still  has  serious  drawbacks. 
It  is  often  difficult  to  make  fine  phonetic  distinctions . (for  example, 
distinguishing  the  word  pair  "Sue/shoe”)  reliably  across  a  wide  range 
of  speakers.  Furthermore,  the  application  of  context-dependent  rules 
often  requires  the  specification  of  the  correct  context,  a  process 
that  can  be  prone  to  error.  (For  example,  the  identification  of  a 
retro  flexed  /t/  in  the  word  "tree”  depends  upon  correctly  identifying 
the  retroflex  consonant  /r/.)  Problems  such  as  these  suggest  that  a 
detailed  phonetic  transcription  of  an  unknown  utterance  may  not  by 
itself  be  a  desirable  aim  for  the  early  application  of  phonetic 
knowledge. 


Constraints  on  Sound  Patterns  Detailed  segmental  representation 
of  the  speech  signal  constitutes  but  one  of  the  sources  of  encoded 
phonetic  information.  The  sound  patterns  of  a  given  language  are  not 
only  limited  by  the  inventory  of  basic  sound  units,  but  also  by  the 
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allowable  combinations  of  these  sound  units.  Knowledge  about  such 
phonotactic  constraints  is  presumably  very  useful  in  speech 
communication,  since  it  provides  native  speakers  with  the  ability  to 
£111  in  phonetic  details  that  are  otherwise  not  available  or  are 
distorted.  Thus,  as  an  extreme  example,  a  word  such  as  "splint"  can 
be  recognized  without  having  to  specify  the  detailed  acoustic 
characteristics  of  the  phonemes  /s/,  /p/,  and  /n/.  In  fact,  "splint" 
is  the  only  word  in  the  Merriam  Pocket  Dictionary  (containing  about 
20,000  words)  that  satisfies  the  following  description: 

[CONS]  [CONS]  [1]  [VOWEL]  [NASAL]  [STOP] . 

In  a  study  of  the  properties  of  large  lexicons,  Shipman  and  Zue  (1982) 
found  that  knowledge  of  even  broad  specification  of  the  sound  patterns 
of  American  English  words,  both  at  the  segmental  and  suprasegmental 
levels,  imposes  strong  constraints  on  their  phonetic  identities.  For 
example,  if  each  word  in  the  lexicon  is  represented  only  in  terms  of  6 
broad  manner  categories  (such  as  vowel,  stop,  strong  fricative,  etc.), 
then  the  average  number  of  words  in  a  20,000-word  lexicon  that  share 
the  same  sound  pattern  is  about  2.  In  fact,  such  crude  classification 
will  enable  about  1/3  of  the  lexical  items  to  be  uniquely  determined. 

There  is  indirect  evidence  that  the  broad  phonetic 
characteristics  of  speech  sounds  and  their  structural  constraints  are 
utilized  to  aid  human  speech  perception.  For  example,  Blesser  (1969) 
has  shown  that  people  can  be  taught  to  perceive  spectrally-rotated 
speech,  in  which  manner  cues  and  suprasegmental  cues  are  preserved 
while  detailed  place  cues  are  severely  distorted.  The  data  on 
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misperception  of  fluent  speech  reported  by  Bond  and  Games  (1980)  and 
the  results  of  experiments  on  listening  for  mispronunciation  reported 
by  Cole  and  Jakimik  (1980)  also  suggest  that  the  perceptual  mechanism 
utilizes  information  about  the  broad  phonetic  categories  of  speech 
sounds  and  the  constraints  on  how  they  can  be  combined. 


Proposed  Systi 


Based  on  the  results  of  the  two  studies  cited 


above,  we  propose  a  new  approach  to  phonetically-based  isolated-word 
recognition.  This  approach  is  distinctly  different  from  previous 
attempts  in  that  detailed  phonetic  analysis  of  the  acoustic  signal  is 
not  performed.  Rather,  the  speech  signal  is  segmented  and  classified 
into  several  broad  manner  categories.  The  broad  phonetic  (manner) 
classifier  serves  several  purposes.  Pirst,  errors  in  phonetic 
labeling,  which  are  most  often  caused  by  detailed  phonetic  analyses, 
would  be  reduced.  Second,  by  avoiding  fine  phonetic  distinctions,  the 
system  should  also  be  less  sensitive  to  interspeaker  variations. 
Pinally,  we  speculate  that  the  sequential  constraints  and  their 
distributions,  even  at  the  broad  phonetic  level,  may  provide  powerful 
mechanisms  to  reduce  the  search  space  substantially.  This  last 
feature  is  particularly  important  when  the  size  of  the  vocabulary  is 
large  (of  the  order  of  several  thousand  words  or  more) . 


Once  the  acoustic  signal  has  been  reduced  to  a  string  (or 
lattice)  of  phonetic  segments  that  have  been  broadly  classified,  the 
resulting  representation  will  be  used  for  lexical  access.  The  intent 
is  to  reduce  the  number  of  possible  word  candidates  by  utilizing 
knowledge  about  the  structural  constraints,  both  segmental  and 
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suprasegmental ,  of  the  words.  The  result,  as  Indicated  previously, 
should  be  a  relatively  small  set  of  word  candidates.  The  correct  word 
will  then  be  selected  through  judicious  applications  of  detailed 
phonetic  knowledge. 

In  summary,  this  paper  presents  a  new  approach  to  the  problem  of 
recognizing  isolated  words  from  large  vocabularies  and  multiple 
speakers.  The  system  initially  classifies  the  acoustic  signal  into 
several  broad  manner  categories.  Once  the  potential  word  candidates 
have  been  significantly  reduced  through  the  utilization  of  the 
structural  constraints,  then  a  detailed  examination  of  the  acoustic 
differences  would  follow.  Such  a  procedure  will  enable  us  to  deal 
with  the  large  vocabulary  recognition  problem  in  an  efficient  manner. 
What  is  even  more  important  is  the  fact  that  such  an  approach  bypasses 
the  often  tedious  and  error-prone  process  of  deriving  a  complete 
phonetic  transcription  from  the  acoustic  signal.  In  this  approach, 
detailed  acoustic  phonetic  knowledge  can  be  applied  in  a  top-down 
verification  mode,  where  the  exact  phonetic  context  can  be  specified. 
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Exploring  Phonotactic  and  Lexical  Constraints  in  Word  Recognition.  Daniel  P. 
Huttenlocher  and  Victor  w.  Zue  (Room  36-549,  Department  of  Electrical 
Engineering  and  Computer  Science,  Massachusetts  Institute  of  Technology, 
Cambridge,  Massachusetts,  02139) 

In  a  previous  meeting  of  the  Society,  Zue  and  Shipman  demonstrated 
that  the  constraints  imposed  by  the  allowable  sound  sequences  of  a  language 
are  extremely  powerful.  Even  at  a  broad  phonetic  level  of  representation, 
sequential  constraints  severely  limit  the  number  of  possible  word  candidates 
[JASA  Vol.  71,  S7].  This  provides  an  attractive  model  for  lexical  access 
based  on  partial  phonetic  information.  However,  Zue  and  Shipman's  results  did 
not  take  into  account  the  fact  that  the  acoustic  realizations  of  phonetic 
segments  are  highly  variable,  and  this  variability  introduces  a  good  deal  of 
recognition  ambiguity  in  the  initial  classification  of  the  signal.  We  have 
conducted  a  set  of  studies  investigating  the  robustness  of  sequential  phonetic 
constraints  with  respect  to  variability  and  error  in  broad  phonetic 

classification.  In  these  studies  segment  misclassifications  or  deletions  are 
permitted.  In  one  study  it  was  found  that  the  phonetically  variable  parts  of 
words  (around  reduced  syllables)  provide  much  less  lexical  constraint  than  the 
phonetically  invariant  parts.  Thus,  by  utilizing  the  information  from  robust 
parts  of  a  word,  a  large  lexicon  can  still  be  partitioned  into  small 
equivalence  classes.  Detailed  results  of  the  studies  will  be  presented. 
[Work  supported  by  the  Office  of  Naval  Research  under  contract 
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In  a  previous  meeting  of  the  Society,  Zue  and  Shipman  presented  some  results 
demonstrating  the  predictive  power  of  phonotactic  and  lexica]  constraints  in  American 
English.  Using  the  20,000-word  Merriam  Webster  s  Pocket  Dictionary  as  their  database, 
they  mapped  the  phonemes  of  each  word  into  one  of  six  broad  phonetic  categories:  vowels, 
stops,  nasals,  liquids  and  glides,  strong  fricatives,  and  weak  fricatives.  One  can  view  the 
broad  phonetic  classifications  as  partitioning  the  lexicon  into  equivalence  classes  of  words 
sharing  the  same  phonetic  class  pattern.  Thus,  for  example,  the  words  "speak",  "stop", 
"scout",  and  48  other  words  fall  into  the  same  equivalence  class:  (STRONG- FRICATIVE) 
(STOP)  (VOWEL)  (STOP).  Zue  and  Shipman  found  that  the  average  size  of  these 
equivalence  classes  is  approximately  2,  and  the  maximum  class  size  is  approximately  200. 
In  addition,  it  was  found  that,  even  at  this  broad  phonetic  level,  approximately  1/3  of  the 
words  in  the  20,000-word  lexicon  can  be  uniquely  specified.  One  conclusion  of  their  study 
is  that  lexical  and  phonotactic  constraints  are  extremely  powerful,  even  at  the  broad 
phonetic  level. 

Before  presenting  any  new  results,  we  would  like  to  add  two  footnotes  to  the  studies 
conducted  by  Zue  and  Shipman.  First,  the  average  equivalence  class  size  may  not  have 
been  an  appropriate  measure  of  lexical  and  phonotactic  constraint.  A  more  informative 
measure  may  be  the  expected  value  of  the  equivalence  class.  That  is,  given  a  word,  what  is 
the  size  of  the  equivalence  class  into  which  the  word  is  likely  to  fall.  We  computed  the 
expected  value  for  the  Zue  and  Shipman  results  and  they  are  shown  on  the  first  overlay. 
The  expected  value,  while  greater  by  an  order  of  magnitude  than  the  average  value,  still 
represents  only  a  tenth  of  a  percent  of  the  entire  lexicon.  As  an  indication  of  the  spread  of 
the  distribution,  we  have  also  included  the  median  class  size. 
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It  should  be  noted  that  these  results  are  for  a  lexicon  where  the  words  are  given  uniform 
weighting.  However,  the  frequency  distribution  of  English  words  is  highly  skewed.  When 
the  words  in  the  lexicon  are  weighted  in  terms  of  their  frequency  of  occurrence  based  on 
the  Brown  Corpus,  as  shown  on  the  second  overlay,  the  median  increases  a  good  deal, 
whereas  as  the  expected  value  only  increases  moderately.  For  the  remainder  of  this 
presentation,  the  words  in  the  lexicon  have  all  been  weighted  by  their  frequency  of 
occurrence  in  the  Brown  Corpus,  thus  providing  a  closer  approximation  to  the  usage  of 
words  in  a  language. 

Our  second  comment  on  Zue  and  Shipmans  results  is  that  they  explored 
segmental/ phonetic  and  prosodic  constraints  independently  of  one  another.  Since  these 
sources  of  information  are  presumably  useful  in  combination,  we  augmented  the  broad 
phonetic  representation  with  two-level  stress  information  for  the  word.  Thus,  for  example, 
the  word  "piston"  is  represented  both  by  a  broad  phonetic  classification  of:  (STOP) 
(VOWEL)  (STRONG-FRICATIVE)  (STOP)  (VOWEL)  (NASAL),  and  a  prosodic 
representation  of:  (STRESSED)  (UNSTRESSED).  The  results  of  incorporating  stress 
information  are  shown  on  the  next  viewgraph.  As  can  be  seen,  the  median,  the  expected 
value  and  the  maximum  equivalence  class  size  all  decreased  as  a  result  of  introducing 
lexical  stress  information,  suggesting  the  usefulness  of  prosodic  information. 

The  above  results  demonstrate  that  broad  phonetic  classifications  of  words  can,  in 
principle,  reduce  the  number  of  word  candidates  significantly.  However,  the  acoustic 
realization  of  a  phone  can  be  highly  variable,  and  this  variability  introduces  a  good  deal  of 
recognition  ambiguity  in  the  initial  classification  of  the  speech  signal.  At  one  extreme,  the 
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acoustic  characteristics  of  a  phoneme  can  undergo  simple  modifications  as  a  consequence 
of  contextual  and  inter-speaker  differences.  At  the  other  extreme,  contextual  effects  can 
also  produce  severe  modifications  in  which  phonemes  or  syllables  are  deleted  altogether. 
Thus,  for  example,  the  word  'international"  can  have  many  different  realizations, 
including  the  deletion  of  the  second  as  well  as  penultimate  syllables. 

In  order  to  evaluate  the  viability  of  a  broad  phonetic  class  representation  for  speech 
recognition  systems,  two  major  problems  must  first  be  considered.  The  first  problem  is  that 
of  mis-labeling  a  phonetic  segment,  and  the  second  problem  is  the  deletion  of  a  segment 
altogether.  It  is  important  to  note  that  these  phenomena  can  occur  resulting  from  an  error 
by  the  speech  recognition  system,  and  as  a  consequence  of  the  high  level  of  variability  in 
natural  speech.  That  is,  not  only  can  the  recognizer  make  a  mistake,  a  given  speaker  can 
utter  a  word  with  changed  or  deleted  segments.  Therefore,  even  a  perfect  recognizer  would 
still  have  "errors"  in  its  input.  The  primary  focus  of  this  talk  is  to  investigate  the  effects  on 
Zue  and  Shipman -s  results  when  errors,  both  in  classification  and  segmentation,  are 
introduced. 

We  have  tried  to  infer  the  effect  of  mis-labeling  phonetic  segments,  by  allowing  for 
reasonable  confusions  among  the  six  phonetic  classes.  The  allowable  confusion  is 
determined  based  on  our  acoustic-phonetic  intuitions,  as  well  as  our  past  experience  with 
speech  recognition  front-ends.  Some  of  the  confusions  are  context-independent,  whereas 
others  are  permitted  only  under  certain  phonetic  environments.  In  one  study,  we  allowed 
strong  fricatives  to  be  confused  with  weak  fricatives  while  only  medial  nasals  can  be 
confused  with  with  liquids  and  glides,  and  only  word  initial  aspirated  stops  can  be  confused 


with  strong  fricatives.  Thus,  in  this  study  each  word  is  represented  by  one  or  more  broad 
phonetic  sequences,  depending  on  the  possible  confusions  in  the  word.  In  the  next 
viewgraph  we  see  that  introducing  these  confusions  did  not  change  the  previous  results 
significantly.  This  suggests  that,  if  the  classification  uncertainty  is.  reasonable,  then  lexica] 
constraints  imposed  by  sequences  of  broad  phonetic  classes  are  still  extremely  powerful. 

We  now  turn  to  the  second  issue,  namely,  segmentation  uncertainty  due  to  front-end 
error  or  alternate  pronunciations.  The  broad  phonetic  representation  cannot  handle 
segment  or  syllable  deletions,  because  when  a  segment  is  deleted  the  broad  phonetic  class 
sequence  is  affected.  Traditionally,  this  problem  is  solved  by  expanding  the  lexicon  via 
phonological  rules,  in  order  to  include  all  possible  pronunciations  of  each  word.  We  find 
this  alternative  unattractive  for  several  reasons.  First  of  all,  dictionary  expansion  does  not 
capture  the  nature  of  phonetic  variability.  Once  a  given  word  is  represented  as  a  set  of 
alternate  pronunciations,  the  fact  that  certain  segments  of  a  word  are  highly  variable  while 
others  are  relatively  invariant  is  lost.  In  fact,  we  shall  see  that  the  less  variable  segments  of 
a  word  provide  more  lexical  constraint  than  those  segments  which  are  highly  variable. 
Another  problem  with  lexical  expansion  is  that  of  assigning  realistic  likelihood  measures  to 
each  pronunciation.  Finally,  storing  all  alternate  pronunciations  is  computationally 
expensive,  since  the  size  of  tH  lexicon  can  increase  substantially. 

Some  segments  of  a  word  are  highly  variable,  while  others  are  more  or  less  invariant. 
Depending  on  the  extent  to  which  the  variable  segments  constrain  lexical  access,  it  might 
be  possible  to  represent  words  only  in  terms  of  their  less  variable  parts.  It  is  interesting  to 
note  that,  in  American  English,  most  of  the  low-level  phonological  rules  apply  to 


unstressed  syllables.  In  other  words,  phonetic  segments  around  unstressed  syllables  are 
more  variable  than  those  around  stressed  syllables.  Perceptual  results  have  also  shown  that 
the  acoustic  cues  for  phonetic  segments  around  unstressed  syllables  are  usually  far  less 
reliable  than  around  stressed  ones.  Thus,  one  may  ask  to  what  extent  the  phones  in 
unstressed  syllables  are  useful  for  lexical  access. 

In  an  attempt  to  answer  this  question,  we  compared  the  relative  lexical  constraint  of 
phones  in  stressed  versus  unstressed  syllables.  In  one  experiment,  we  classified  the  words 
in  the  20,000-word  Websters  Pocket  Dictionary  either  according  to  only  the  phones  in 
stressed  syllables,  or  according  to  only  the  phones  in  unstressed  syllables.  Lexica] 
representations  for  this  experiment  are  illustrated,  for  the  word  "piston",  in  the  next 
viewgraph.  In  the  first  condition,  shown  on  the  left,  the  phones  in  stressed  syllables  were 
mapped  into  their  corresponding  phonetic  classes  while  each  unstressed  syllable  was 
mapped  into  a  placeholder  symbol.  In  the  second  condition,  shown  on  the  right,  the 
opposite  was  done.  For  both  conditions,  stress  information  is  retained. 

The  results  of  this  experiment  are  given  in  the  next  viewgraph.  By  comparing  the 
outcomes  of  the  two  conditions,  it  can  be  seen  that  information  within  stressed  syllables 
provides  much  more  constraints  for  lexical  access  than  that  in  unstressed  syllables.  This  is 
particularly  interesting  in  light  of  the  fact  that  the  phones  in  stressed  syllables  are  much  less 
variable  than  those  in  unstressed  syllables.  Therefore,  recognition  systems  may  not  have  to 
be  terribly  concerned  with  correctly  identifying  the  phones  in  unstressed  syllables.  Not 
only  is  the  signal  highly  variable  in  these  segments,  making  classification  difficult;  the 
segments  do  not  constrain  recognition  as  much  as  the  less  variable  segments. 


This  representation  is  very  robust  with  respect  to  segmental  and  syllabic  deletions.  As 
pointed  out  previously,  most  segment  deletions  occur  in  unstressed  syllables.  Since  the 
phones  in  unstressed  syllables  are  not  included  in  the  representation,  their  deletion  or 
modification  is  ignored. 

In  summary,  we  have  demonstrated  a  method  for  encoding  the  words  in  a  large  lexicon 
using  broad  phonetic  and  prosodic  information.  This  scheme  takes  advantage  of  the  fact 
that  even  at  a  broad  level  of  description,  the  sequential  constraints  on  allowable  sound 
sequences  are  very  strong.  It  also  makes  use  of  the  fact  that  the  phonetically  variable  parts 
of  words  provide  much  less  lexical  constraint  than  the  phonetically  invariant  parts.  The 
interesting  properties  of  the  representation  are  that  it  is  based  on  relatively  robust  phonetic 
classes,  it  allows  for  phonetic  variability,  and  it  partitions  the  lexicon  into  very  small 
equivalence  classes.  This  makes  the  representation  attractive  as  a  search  avoidance 
techniques  for  large- vocabulary  speech  recognition  systems. 
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Recant  research  [Zue  and  Shipman,  JASA  VoL  71,  S7;  Huttenlocher  and  Zue,  this  meeting]  has 
shown  that  a  broad  phonetic  representation  of  speech  provides  strong  constraints  for  lexical  access 
of  isolated  words,  ki  addition,  Church  [1983  PhD  Thesis,  MT]  has  demonstrated  the  utllty  of  riwtnilwri 
atophoric  constraints  for  parsing  a  sentence  from  a  phonetician*  transcription.  This  gives  reason  to 
beieve  that  the  couping  of  lexical  and  aflophonic  constraints  can  be  a  powerful  tool  in  oontinuous 
speech  recognition.  The  present  study  explores  how  broad  phonetic  constraints  can  be  appled  to  a 
restricted  continuous  speech  task.  Using  a  broad  phonetic  representation  derived  from  an  ideal 
transcription,  it  was  found  that  on  the  average,  70%  of  the  word  boundaries  in  the  digit  vocabulary 
can  be  identified.  Extending  this  approach  to  speech  data,  we  have  implemented  a  classifier  which 
derives  a  broad  phonetic  representation  from  the  speech  signal  Preimtnary  results  indicate  that 
alophonic  and  lexical  constraints  can  be  effective  In  reducing  the  number  of  string  candidates,  based 
upon  the  output  from  this  classifier.  [Work  supported  by  the  Office  of  Naval  Research  under  contract 
N00014-82-K-0727  and  by  the  System  Development  Foundation.] 
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Introduction 


In  a  previous  meeting  of  the  Society,  Zue  and  Shipman  presented  some  results 
demonstrating  the  constraints  imposed  on  the  sound  patterns  of  a  language.  By  indexing 
words  into  a  lexicon  based  on  broad  phonetic  representations,  the  number  of  words  sharing 
a  common  representation  is  very  small.  As  a  result,  they  proposed  an  approach  to  isolated 
word  recognition  for  large  vocabularies.  In  their  proposal,  the  speech  signal  is  first 
classified  into  a  broad  phonetic  string.  The  broad  phonetic  representation  is  then  used  for 
lexical  access,  resulting  in  a  small  set  of  word  candidates.  Finally,  fine  phonetic  distinctions 
are  performed  to  determine  which  of  these  word  candidates  are  actually  spoken.  We’ve 
just  seen  in  the  previous  talk  some  refinements  to  their  original  proposal.  Such  an 
approach  to  isolated  word  recognition  is  in  fact  being  pursued  in  a  number  of  laboratories. 

In  contrast,  this  paper  describes  an  attempt  to  exploit  such  constraints  in  continuous 
speech  recognition.  In  particular,  we  focused  our  inquiries  on  the  digit  vocabulary.  There 
are  two  questions  which  we  tried  to  address:  1)  Are  there  powerful  enough  lexical  and 
allophonic  constraints  to  allow  words  in  a  restricted  task  like  continuous  digit  recognition  to 
be  recovered?  and  2)  Can  such  a  system  be  realistically  implemented  with  good 
performance?  The  first  part  of  this  talk  will  describe  a  set  of  experiments  designed  to 
determine  whether  a  digit  string  can  be  recovered  from  an  ideal  transcription.  Then  in  the 
second  part  of  this  talk,  we  will  describe  our  first  attempt  to  implement  a  speaker- 
independent  continuous  digit  recognition  system  in  which  lexical  candidates  are  reduced 
by  using  a  broad  phonetic  classification  of  the  speech  signal. 


Constraints  on  sound  sequences  were  applied  first  to  ideal  phonetic  and  then  to  ideal 
broad  phonetic  representations  of  digit  strings  to  postulate  word  boundaries.  2000  digit 
strings  of  random  order  and  random  length  containing  approximately  8000  boundaries 
were  used.  We  found  that  from  an  ideal,  detailed  phonetic  transcription,  every  digit 
boundary  can  be  positively  identified.  However,  it  is  a  very  difficult  task  to  automatically 
produce  an  accurate  phonetic  transcription  across  a  wide  population  of  speakers.  On  the 
other  hand,  we  believe  that  producing  a  broad  phonetic  representation  from  the  acoustic 
signal  is  not  as  difficult.  Such  a  representation  is  also  more  robust  against  environmental 
and  interspeaker  variabilities.  Thus,  for  the  second  part  of  this  experiment,  the  constraints 
were  relaxed  by  mapping  the  phones  into  broad  phonetic  categories  in  place  of  detailed 
phonetic  transcriptions.  Six  broad  classes  were  used:  liquid  or  glide,  stop,  vowel,  nasal, 
strong  fricative,  and  weak  fricative.  Coarticulation  effects,  such  as  gemination  of  /»'  in 
"6-7"  were  ignored  in  producing  the  transcriptions. 

An  example  of  the  procedure  is  shown  in  the  figure.  The  digit  string  "64583",  as  shown 
on  the  top  line,  is  mapped  into  the  broad  phonetic  representation  shown  on  the  second  line. 
No  boundary  marks  are  given  in  this  broad  representation,  although  their  placement  is 
indicated  by  the  sharp  sign  shown  on  the  line  above.  For  this  example,  three  of  the  word 
boundaries  can  be  identified  because  no  word  in  the  lexicon  contains  the  sequence  formed 
by  the  broad  labels  on  each  side  of  the  boundary.  However,  the  boundary  between  "S"  and 
”8"  is  not  identified  because  the  lexical  representations  of  both  the  digits  ”4”  and  "5" 
contain  the  sequence  "weak-fricative  vowel".  Performing  this  experiment  on  the  2000  digit 
strings,  70%  of  the  word  boundaries  were  found. 
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The  ability  to  definitely  identify  70%  of  the  word  boundaries  is  impressive,  but  not 
necessarily  surprising.  For  example,  Church  described  in  his  doctoral  thesis  how  detailed 
allophonic  constraints  can  be  used  to  successfully  parse  a  sentence  from  a  phonetician’s 
transcription.  However,  these  results  may  not  be  directly  applicable  to  real  data.  With  real 
data,  phonetic  variabilities  and  front-end  errors  dictate  that  boundaries  only  be  proposed. 
Instead  of  positively  identifying  word  boundaries,  a  system  could  propose  words  and  their 
corresponding  boundaries  by  examining  the  "sequence"  of  broad  class  labels.  That  is,  the 
word  boundaries  would  be  proposed  at  the  beginning  and  end  of  each  sequence.  In 
addition,  allophonic  constraints  can  help  to  reduce  the  possibly  large  number  of  word 
candidates.  A  preliminary  system  for  exploring  these  lexical  and  allophonic  constraints  will 
be  described  next. 

Implementation 

We  have  implemented  a  broad  phonetic  classifier  and  a  lexical  access  component  that 
will  be  part  of  a  continuous  digit  recognition  system.  In  this  implementation,  eight 
phonetic  classes,  as  shown  in  the  figure,  were  used.  Note  that  they  are  slightly  different 
than  those  used  in  the  early  part  of  the  study.  These  symbols  will  be  used  throughout  the 
remainder  of  this  talk.  The  envisioned  general  structure  of  our  recognition  system  is  shown 
in  the  next  figure.  From  the  speech  signal,  a  broad  phonetic  classification  is  first 
performed.  The  resulting  broad  phonetic  representation  is  of  the  form  of  a  lattice 
composed  of  broad  phonetic  labels.  From  the  broad  phonetic  representation,  the  lexical 
access  component  produces  a  lattice  of  word  candidates  for  the  system.  This  set  of 
candidates  is  a  reduced  set  from  all  possible  candidates  and  can  be  given  to  a  verification 
component  which  would  use  more  detailed  acoustic  analysis  to  identify  the  true  sentence. 
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The  broad  classifier  extracts  acoustic  parameters  from  the  speech  signal  and  characterizes 
the  resulting  parameters  as  acoustic  features.  From  the  set  of  features,  the  system  uses  a  set 
of  production  rules  to  deduce  which  broad  classes  may  be  present  at  each  segment.  There 
are  several  important  attributes  of  the  classifier.  One  is  that  it  begins  classification  by 
labeling  regions  of  the  speech  signal  which  can  be  identified  robustly.  Another  attribute  is 
that  in  regions  where  the  cues  are  not  as  robust,  more  than  one  label  is  allowed.  By 
allowing  ambiguity  in  the  labels  under  uncertainty,  as  the  classifier  does,  it  is  insured  that 
the  correct  answer  is  not  ruled  out  unless  there  is  no  evidence  for  it. 

Lexical  Access  and  AOophonfe  Constraints 

In  lexical  access,  lexical  and  allophonic  constraints  are  used  to  map  broad  class 
transcriptions  of  the  words  in  the  lexicon  to  the  lattice  of  broad  labels,  and  indirectly,  to  the 
speech  signal.  The  next  Figure  illustrates  how  allophonic  constraints  are  utilized  in  the 
lexical  representation  of  the  digits.  The  context  in  which  each  pronunciation  occurs  can 
then  be  used  to  constrain  when  a  word  is  hypothesized.  In  the  Figure,  three  pronunciations 
of  the  word  ”8"  are  shown.  A  sample  context  in  which  the  second  and  third  pronunciations 
could  occur  is  shown  on  the  right.  Each  pronunciation  differs  in  the  ailophone  of  it.  This 
can  be  captured  at  the  broad  level  as  shown  in  the  center  column.  A  released  it  is 
transcribed  as  a  released  stop.  And  an  unreleased  it  and  flap  are  transcribed  as  silence 
and  a  short  voiced  obstruent,  respectively. 

The  set  of  reference  transcriptions  that  are  used  for  lexical  retrieval  is  produced  by  the 
system  from  a  given  set  of  pronunciations  of  the  words  in  the  lexicon.  The  derived 
transcriptions  contain  alternate  broad  phonetic  pronunciations  of  each  digit  and  the  context 
under  which  they  can  occur.  Each  transcription  is  matched  against  the  labeled  segments  of 


see 


■v’  %'  v" 


the  broad  class  lattice.  Thus  errors  made  at  one  point  in  time  will  not  affect  recognition  of 
the  rest  of  the  sentence.  The  next  Figure  illustrates  application  of  lexical  constraints  on  a 
sample  digit  string  "5-8-6".  Each  box  in  the  figure  represents  the  position  of  a  segment 
relative  to  the  other  segments  in  time,  but  does  not  convey  any  information  about  duration 
or  rank.  Part  (a)  depicts  the  broad  segmentation  produced  by  the  classifier  for  the  three 
digits.  In  (b),  the  lattice  of  matching  words  from  the  lexicon  is  shown.  Ail  the  digits  in 
starred  boxes  can  be  removed  because  each  is  a  word  candidate  that  will  result  in  an 
incomplete  path.  The  resulting  lattice  is  shown  in  (c).  Note  that  the  starred  "5"  could  also 
be  removed  by  allophonic  constraints.  Allophonic  constraints  require  that  if  a  vowel 
follows  the  "5",  the  /v/  represented  by  the  short  voiced  obstruent  should  not  be  deleted. 
The  pronunciation  of  the  starred  "5"  has  w  deleted  and  is  thus  incorrect  in  this  context. 

The  next  figure  illustrates  actual  constraint  application  in  lexical  access  for  a  longer  digit 
string.  A  spectrogram  of  the  digit  string  "7620085”,  the  corresponding  broad 
representation,  and  the  corresponding  word  lattice  are  shown.  It  can  be  observed  that  the 
word  lattice  is  much  reduced  from  the  general  case  where  all  the  words  in  the  lexicon  can 
begin  at  each  segment.  But  it  should  also  be  noted  that  "3",  "4”,  and  ”5"  are  not 
distinguished  in  the  broad  representation.  However,  a  simple  check,  such  as  vowel  height, 
may  be  able  to  differentiate  among  some  of  them. 

Results 

As  mentioned  earlier,  we  have  implemented  on  a  Lisp  machine  workstation  such  a 
speaker-independent  continuous  digit  recognition  system  up  to  the  level  of  lexical 
candidate  reduction.  The  system  was  developed  using  the  speech  data  from  one  male 
speaker.  We  have  just  evaluated  the  performance  of  the  system  for  the  first  time  last  week 


using  some  two  hundred  digits  spoken  by  three  new  speakers,  one  male  and  two  female. 
The  preliminary  results  indicate  that  1%  of  the  time  the  correct  digit  is  not  one  of  the  lexical 
candidates.  The  corresponding  depth  of  the  digit  lattice  is  four.  We  would  like  to  stress 
that  the  system  is  under  active  development  and  that  the  performance  results  are  very 
preliminary.  Nevertheless,  we  are  encouraged  by  the  results,  and  feel  that  this  may  be  a 
viable  approach  to  continuous  digit  recognition. 

Summary 

In  summary,  this  paper  proposed  a  new  approach  to  continuous  digit  recognition.  In  this 
approach,  the  speech  signal  is  initially  classified  into  broad  phonetic  categories.  It  was 
shown  that  allophonic  and  lexical  constraints  can  be  powerful  even  at  a  broad  phonetic 
level.  Thus,  lexical  access  based  on  a  broad  phonetic  description  will  hopefully  result  in  a 
small  number  of  candidate  digit  strings.  Whenever  more  than  one  digit  candidate  spans  a 
given  time  interval,  fine  phonetic  distinctions  will  be  performed  to  select  the  best 
candidate.  Because  of  the  fact  that  fine  phonetic  classification  is  not  performed  at  the 
onset,  the  system  has  the  potential  of  being  robust  against  inter-speaker  variabilities. 
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QUESTIONS 


Are  there  powerful  enough  lexical  and 
allophonic  constraints  to  allow  words  in  a 
restricted  task  like  continuous  digit  recognition 
to  be  recovered? 

Can  such  a  system  be  realistically  implemented 
with  good  performance? 


SEQUENCE  CONSTRAINT  EXAMPLE 

6  #4#  5  #  8  #  3 

//  W  /\  /l\  /\  /w 

SF  V  S  SF  WF  V  G  WF  V  WF  V  S  WF  G  V 
it  t 


N  nasal 

SF  strong  fricative 
WF  ueak  f  r i cat i ve 


G  liquid  or  glide 
S  stop 
V  voual 


BROAD  CLASSES  USED 
WITH  NATURAL  SPEECH 


G  sonorants 

N  intervocalic  nasal 

S  stop 

SF  strong  fricative 

SVO  short  voiced  obstruent 

V  vowel 

WF  weak  fricative 

silence 


spttch 


BROAD  PHONETIC  CLASSIFIER 


Produces  broad  class  representation  of  input 
speech  signal 

Principles 

•Label  regions  which  can  be  identified 
robustly 


•Allow  ambiguity  when  uncertain 


ALLOPHONIC  CONSTRAINTS 


Lexical  items  can  be  represented  in  terms  of 
multiple  pronunciations 


Example: 

eyt  vowel  stop 

eyt°  vowel  silence  /_  nasal 

eyr  vowel  svo  /_  vowel 


SUMMARY 


Allophonic  and  lexical  constraints  can  be 
powerful  even  at  a  broad  phonetic  level  for  digit 
recognition 

Lexical  access  based  on  broad  phonetic 
information  can  help  to  reduce  the  number  of 
digit  candidates 

Such  a  system  can  potentially  be  speaker- 
independent 
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The  Use  of  Phonotactk  Constraints  to  Determine  Word  Boundaries.  Lori  F.  Lamel 
(Room  36*545,  Department  of  Electrical  Engineering  and  Computer  Science, 
Massachusetts  Institute  of  Technology,  Cambridge,  MA  02139) 

Phonotactic  constraints  limit  the  permissible  word  internal  consonant  sequences  in 
English.  In  some  cases,  knowledge  of  the  phoneme  sequence  uniquely  specifies  the 
location  of  the  word  boundary,  while  in  other  cases,  phonological  rules  based  on  allowable 
consonant  sequences  are  not  sufficient.  For  example,  the  word  boundary  can  be  uniquely 
placed  in  the  sequence  /...  m  g  1 ...  /,  as  in  the  word  pair  "some  glass",  whereas  the  the  word 
boundary  location  is  ambiguous  in  the  phoneme  sequence  /...  s  t  r  ...  /  without  further 
acoustic  information.  The  /...  s  t  r  ...  /  may  have  a  word  boundary  in  one  of  three  places  as 
in  "last  rain",  "race  trials",  and  "may  stretch".  Studies  were  conducted  to  determine  the 
utility  of  phonotactic  constraints  to  predict  word  boundaries.  The  databases  included  the 
Merriam-Webster  Pocket  Dictionary,  a  phonemically  balanced  set  of  sentences,  and 
samples  of  unrestricted  text.  Results  indicate  that:  i)  word  internal  consonant  sequences 
represent  a  very  small  subset  of  all  permissible  consonant  sequences  across  word 
boundaries;  and  2)  acoustic-phonetic  knowledge  is  needed  when  the  word  boundary  is 
ambiguous.  Results  on  the  differences  between  word  and  syllable  boundaries  will  also  be 
presented.  [Work  supported  by  the  Office  of  Naval  Research  under  contract  N00014-82-k* 

0727  and  the  System  Development  Foundation.] 
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This  study  investigates  the  occurrence  of  consonant  sequences  at  word  boundaries  and 
within  words  in  English.  The  main  question  raised  is  "Given  a  consonant  sequence,  can  a 
word  boundary  location  be  determined?”.  Earlier  work  [Zue  and  Shipman,  J.  Acoust  Soc. 
Am.  Suppl.  1 71,S7  (1982)]  has  shown  that  there  are  lexical  constraints  limiting  the  number 
of  possible  consonant  sequences  within  words.  The  first  problem  that  we  address  is  what,  if 
any,  structural  constraints  on  consonant  sequences  in  English  limit  the  potential  word 
boundary  sequences. 

The  second  problem  we  address  is  whether  there  are  acoustic  cues  to  word  boundaries  in 
continuous  speech,  and  how  these  could  be  used  for  speech  recognition.  Listeners  are  able 
to  hear  individual  words  even  though  words  in  continuous  speech  are  not  separated  by 
pauses.  Nakatani  and  Dukes  (1979)  have  shown  that  word  boundaries,  as  well  as  syllable 
boundaries,  arc  often  marked  by  acoustic  cues  that  listeners  use  for  speech  perception.  We 
believe  that  a  good  understanding  of  the  acoustic  manifestations  of  phonemes  at  word 
boundaries  will  enable  us  to  resolve  phonetic  ambiguity  and  to  propose  potential  word 
boundaries  from  acoustic  evidence. 

To  address  the  first  problem,  we  determined  the  occurrences  of  consonant  sequences  in  a 
corpus  of  text  files.  The  text  files  ranged  in  size  from  200  words  to  38,000  words  and  were 
obtained  from  a  variety  of  sources.  The  phonemic  transcription  of  each  word  was  obtained 
by  dictionary  lookup. 


The  number  of  distinct  consonant  sequences  were  determined  for  within  words  and 
across  word  boundaries.  The  number  of  occurrences  of  each  distinct  sequence  was  also 
recorded.  Word  boundary  sequences  were  formed  by  concatenating  the  word-tinal 


consonant  sequence  of  the  current  word  with  the  word-initial  consonant  sequence  of  the 
next  word.  In  this  example,  the  word  pair  "western  front"  has  a  word  boundary  sequence 
/nfr/,  whereas  the  word  "western"  has  the  medial  sequence  /st/.  There  are  approximately 
70  distinct  word-initial  consonant  sequences,  and  130  word-fmal  sequences.  Based  on  these 
estimates  there  are  potentially  over  9000  word  boundary  consonant  sequences.  The  top 
curve  in  the  figure  shows  the  upper  bound  on  the  number  of  possible  word  boundary 
sequences  as  a  function  of  the  number  of  words  in  the  corpus.  The  upper  bound  arises  by 
assuming  that  any  word  can  follow  any  other  word,  and  is  thus  the  product  of  the  number 
of  word-fmal  and  word-initial  consonant  sequences  for  the  given  corpus. 

The  lower  curve  shows  the  number  of  word  boundary  sequences  occurring  in  the  corpus. 
In  general,  this  curve  is  about  25%  of  the  upper  bound.  The  behavior  of  the  two  curves 
shown  here  are  similar.  We  can  expert  that  perhaps  the  lower  curve  will  approach  the 
upper  bounding  curve  as  larger  samples  of  text  are  processed.  However,  if  there  are 
structural  constraints  limiting  the  combinations  of  words  in  English,  then  perhaps  the 
consonant  sequences  that  occur  across  word  boundaries  will  remain  a  subset  of  the  total 
number  of  possibilities. 

The  overlap  between  the  word-medial  sequences  and  the  word-boundary  sequences 
gives  a  comparison  of  what  goes  on  within  words  and  across  word  boundaries.  If  the  lexical 
constraints  on  word  internal  sequences  and  the  structural  constraints  across  word 
boundaries  are  the  same,  then  the  curves  for  the  two  conditions  should  be  similar.  The 
overlay  shows  the  number  of  word-medial  consonant  sequences.  On  average  only  20%  of 
the  distinct  word  boundary  sequences  occur  in  word  medial  position.  This  means  if  I  were 
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to  randomly  choose  a  sequence  from  the  distinct  consonant  sequences,  then,  on  the 
average,  4  out  of  5  times  the  sequence  would  specify  a  word  boundary  which  would  not  be 
a  syllable  boundary.  If  I  instead  put  all  the  word  boundary  sequences,  occurring  as  many 
times  as  they  do  in  the  corpus,  into  a  bucket,  then  1  out  of  3  randomly  chosen  sequences 
will  only  occur  at  a  word  boundary.  [When  the  frequencies  of  occurrence  are  accounted 
for,  1  out  of  3  consonant  sequences  specifies  a  word  boundary^] 

There  is  also  a  difference  in  the  average  lengths  of  consonant  sequences  in  word-medial 
and  word-boundary  position.  Word  medial  sequences  are  shorter  than  word  boundary 
sequences.  The  average  length  for  sequences  within  words  is  2.2,  while  across  word 
boundaries  it  is  19.  As  will  be  discussed  next,  longer  sequences  have  more  constraints 
imposed  upon  them. 

How  many  of  the  distinct  consonant  sequences  have  a  unique  boundary  location?  A 
unique  boundary  location  means  that  the  consonant  sequence  can  only  be  divided  in  one 
way  such  that  the  sequence  forms  an  allowable  offset  cluster  followed  by  an  allowable  onset 
cluster.  For  example,  the  sequence  /mgl/  as  in  "some  glass”  can  only  have  a  word 
boundary  between  the  /m/  and  the  /g/.  The  phoneme  sequence  /sts/  can  form  an 
allowable  offset  as  in  "casts"  or  occur  across  a  word  boundary  as  is  "last  side”.  About  65% 
of  the  word  boundary  sequences  have  a  unique  boundary  location.  Another  30%  have  2 
possible  locations.  The  word  boundary  sequences  not  occurring  in  medial  position  have  a 
lower  rate  of  ambiguity,  with  approximately  80%  having  the  boundary  location  uniquely 
specified.  Part  of  these  differences  can  be  accounted  for  by  considering  the  length  of 
sequences  in  the  two  positions.  As  mentioned  before,  on  average,  word  boundary 
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sequences  are  almost  1  consonant  longer  than  medial  sequences.  The  average  number  of 
+  boundary  locations  decreases  with  increasing  length  of  the  consonant  sequence.  60%  of  the 

consonant  sequences  of  length  2  have  ambiguous  boundary  locations  while  less  than  20%  of 
longer  sequences  have  ambiguous  boundary  locations. 


Given  that  an  ideal  phonemic  transcription  cannot  uniquely  specify  a  boundary  location 
in  one  third  of  the  consonant  sequences  occurring  at  word  boundaries,  can  these 
boundaries  be  disambiguated?  Our  belief  is  that  in  some  cases  acoustic  evidence  may  be 
useful.  To  this  end  we  conducted  a  preliminary  experiment  to  investigate  acoustic  cues  to 
word  boundaries  in  labial-stop  sonorant  clusters.  Minimal  pair  phrases  such  as  "grape 
lane”  and  “grey  plane”,  as  shown  in  the  figure,  were  embedded  in  a  carrier  phrase  and 
recorded  by  3  male  speakers.  The  recordings  were  digitized  and  analyzed  using  SPIRE  on 
the  Lisp  Machine.  Parameters  were  extracted  from  the  transcribed  utterances  and 
statistical  analysis  performed. 

The  spectrograms  of  "grape  lane"  and  "grey  plane"  have  several  differences.  These 
include  the  duration  of  the  first  vowel  in  each  pair,  the  amount  of  aspiration  in  the  /p/, 
VOT,  and  some  characteristics  of  the  /!/,  such  as  a  steady  state  region  and  formant 
frequencies  at  voice  onset 


Some  simple  measures  that  can  be  used  to  differentiate  this  pair  are  the  duration  of  the 
release  portion  of  the  stop,  the  duration  of  the  sonorant  and  the  formant  frequencies  at 
voice  onset  Shown  in  black  is  a  histogram  of  the  stop  release  duration  for  /p/  in  word- 
initial  /pi/  clusters.  The  devoiced  portion  of  the  /!/  is  included  as  part  of  the  release  as  was 
shown  in  the  previous  figure.  A  histogram  of  the  duration  of  the  release  of  /p/  in  /p#l/  is 
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shown  in  red.  All  unreleased  /p/'s  were  given  a  duration  of  0  ms.  As  can  be  seen,  about 
half  of  the  time  the  /p/  was  unreleased,  and  in  the  remainder  there  is  a  short,  generally 
weak,  release. 

This  figure  shows  a  scatter  plot  of  the  second  and  third  formant  at  voice  onset  for  the  /r/ 
in  initial  /br/  clusters  and  for  /b#r/.  Although  there  still  a  fair  amount  of  overlap 
between  the  two  conditions,  the  combination  of  and  give  better  discrimination  than 
either  alone. 

Throughout  this  talk  we  have  intentionally  avoided  the  topic  of  syllable  boundaries,  and 
the  comparison  of  syllable  and  word  boundaries.  There  are  several  reasons  for  this.  First, 
we  do  not  know  what  the  correct  syllabification  of  words  should  be.  We  have  looked  at  two 
different  syllabifications  of  a  large  lexicon,  but  neither  shows  the  consistency  we  would  like 
to  have.  In  the  literature  there  are  various  approaches  to  syllabification  and  we  are  looking 
into  this  issue  further.  It  seems  that  we  have  a  bit  of  the  "chicken  and  the  egg"  problem 
here.  We  would  like  to  have  some  theory  for  syllabification  which  could  be  used  to  predict 
acoustic  manifestations  of  consonant  sequences  at  syllable  boundaries.  However,  we 
believe  that  acoustic  evidence  is  necessary  to  get  reliable  syllabification. 

In  summary,  we  have  found  that  there  are  structural  constraints  on  the  allowable 
sequences  of  consonants  at  word  boundaries  in  English.  4  out  of  5  word  boundary 
sequences  do  not  occur  within  a  word.  Of  these  sequences  80%  have  a  word  boundary 
which  is  uniquely  specified  by  the  sequence  of  phonemes.  In  the  cases  where  the  phoneme 
sequence  cannot  uniquely  specify  the  word  boundary  acoustic  information  may  be  of  help. 
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CONSONANT  SEQUENCES  AT  WORD  BOUNDARIES 

-  Are  structural  constraints  useful  to  delimit  word 
boundaries? 

# 

-  Are  there  acoustic  cues  to  word  boundaries? 


’/vVn'Vj 


STUDY  OF  STRUCTURAL  CONSTRAINTS  BASED 
ON  IDEAL  TRANSCRIPTION 


CORPUS 

-  Text  files  ranging  in  size  from  200  words  to 
38,000  words 

PROCEDURE 

-  Identify  distinct  consonant  sequences  within 
words  and  across  word  boundaries 

-  Record  the  number  of  occurrences  for  each 
distinct  consonant  sequence 
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WHAT  IS  A  UNIQUE  BOUNDARY  LOCATION? 


CONSONANT  SEQUENCE 


EXAMPLES: 

/mgl/ 

/sts/ 


i=>  Permissible  coda 

followed  by  permissible 
onset 


>*>  m  #  gl 

"some  glass 

->  sts  # 

"casts" 

st  #  s 

"last  side" 

HOW  MANY  CONSONANT  SEQUENCES  HAVE  A 


UNIQUE  BOUNDARY  LOCATION? 


-  65  %  of  word  boundary  sequences  have  unique 
locations 

-  80  %  of  word  boundary  sequences  not  occurring 
in  word  medial  position  have  unique  boundary 
locations 

*  Most  ambiguous  sequences  have  only  two 
possible  boundary  locations 


m ; ;  iii  ;  1 1 1'  1 1 1  ii  i  in  1 1  ii  1 1  ii  in  1 1  ii  111  ii  i  ii  ii  ill  i  n  n i  n  1 1 1  ii  1 1 1 1 


SUMMARY 


-  4  out  of  5  consonant  sequences  occur  only  at 
word  boundaries 

-  80%  of  word  boundary  sequences  have  only  one 
possible  boundary  location 

-  Acoustic  information  may  help  to  locate  word 
boundaries 


presented  at  the  International  Conference  on 
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COMPUTER  RECOGNITION  OP  ISOLATED  WORDS 
FROM  LARGE  VOCABULARIES: 

LEXICAL  ACCESS  USING  PARTIAL  PHONETIC  INFORMATION 
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Current  approaches  to  iaolalad  wort  recognition  raly  on  daaaicaf  pattern  recognition  techniques  which  utillia  littts  or 
no  speech  apadfie  knowiadgs.  While  the  performance  of  these  systama  ia  quits  good  tor  a  sot  of  restricted 
vocabularies  and  taaks.  may  are  not  readily  oxtendibia  to  more  complex  tasks.  Thia  papar  propoaaa  a  now  approach 
to  oolatad  word  recognition  intandad  for  multi  pio  apoaksra  and  larga  vocabularies.  Tha  syaiam  performs  a  broad 
phonattc  catagorizabon  of  tha  acoustic  aignaL  Thia  partial  phonade  information  ia  than  uaad  for  toxical  access, 
ganaradng  a  small  sat  of  poaaibla  word  candidates.  Fina  phonade  disdnedona  can  than  ba  parformad  in  ordar  to 
datarmsna  which  of  thaaa  word  candidataa  waa  actually  spoken.  Preliminary  results  on  lexical  access  and  word 
candidate  reduction  is  presented. 


1.  INTRODUCTION 

During  tha  past  decade,  significant  advances  have  bean  made  in 
tha  field  of  isolated  word  recognition  (IWR).  Today,  speech 
recognition  systama  can  typically  recognize  a  small  sat  of  words, 
say  20.  for  a  given  speaker  with  an  error  rata  of  leas  than  5V 
Currant  approaches  generally  use  a  pattern  matching  technique 
where  the  input  signal  is  matched  agamat  a  sot  of  storad  templates. 
Tha  auccsea  of  thaaa  systems  can  at  least  in  part  ba  attributed  to 
tha  introduction  of  novel  parametric  representations1,  distance 
nieiii  i  i3  and  tha  wv  oowarfid  tima  attanmam  oracaduro  caked 
dynamic  programming  . 

While  wa  have  dearly  made  significant  advancas  in  dealing  with 
tha  IWR  prooiam.  there  is  aanoua  doubt  regarding  tha  extanditotiity 
of  tha  pattern  matching  approach  to  large  vocabulary,  speaker- 
irKMOtnosnt  noimo  wwi  ncoqnwofl.  unt  or  mv  uvunauons  or 
the  template  matching  approach  ia  that  both  storage  and 
computation  grow  linearly  with  tha  size  of  tha  vocabulary.  That  is. 
a  separate  tamoiate  is  storad  tar  each  word,  and  recognition 


handful  of  words  out  of  a  20.000- word  lexicon,  using  only  half  a 
dozen  broad  phonetic  Classen,  in  tha  next  section  of  tha  papar.  wa 
summarize  a  sat  of  studiea  demonstrating  tha  extant  to  which  a 
partial  phonetic  representation  can  ba  used  to  eliminate  ail  but  a 
small  sat  of  word  candidates.  The  subsequent  section  describes 
tha  system  being  implemented,  and  presents  soma  preliminary 


For  vary  larga  vocabularies  (several  thousand  words  or  mors), 
thaaa  computation  and  storage  requirements  become  ixofUbldvefy 

performance  of  such  IWR  ay  mama  datartarotaa  tar  largo 
vocabularies*  Furthermore,  as  tha  size  of  tha  vocabulary  proem,  it 
becomes  imoarsiiva  that  rscogmtioo  systama  ba  speaker- 
independent  sines  training  tha  syaiam  tar  each  speaker  wik  ba 
painfully  impractical. 


In  this  paoor.  wa  propose  a  now  approach  to  large-vocabulary, 

AmAwaawta  aaIIaa  ^^a  AS.d^i  q^AAqalA 

iQMpi  wwu  ncogmpon.  i  ■pprovcn  rvnn  on  pirnai  pnonvoo 
information  for  itxieal  >cc— t.  maratiy  gratify  raducat  jtia  number 
of  won!  candidates.  The  partial  phonetic  representation  derivee 
He  sower  from  thi  hioh  dMfM  oi  redundenev  in  idvei  lenouene. 

ras  .itm**  a^e  t^b^yraw  wt  •  atmri  nr^arw  y  res  as 

By  exploiting  this  redundancy,  it  is  pcaaibia  to  aHmjnate  ak  but  a 


•kanrmsayoe 


i  by  «■  One*  or  uww  i 


r  Contract  N00014- 


2.  DESIGN  PHILOSOPHY 
2.1  Spectrogram  Reading 

Rekanca  an  general  pattern  matching  techniques  has  bean  partly 
motivated  by  tha  unsatisfactory  performance  of  early  phonsttcaily- 
basad  speech  recognition  systama.  m  fact  the  difficulty  of  tha 
acoustic-phonetic  recognition  task  has  lad  to  speculation  mat 
phonsbc  information  must  ba  derived  primarily  from  syntactic, 
semantic  and  discourse  constraints  rather  than  from  the  acoustic 
aranaf.  However,  tha  poor  performance  of  aarty  phonebcaily-baaod 
recognition  systama  can  ba  attributed  mainly  to  our  limited 
knowledge  of  tno  acoustic  characteristics  of  speech  sounds. 
parKuiorty  tha  effects  of  local  context  This  proturo  is  slowly 
changing.  Wa  now  have  a  far  better  understanding  of  contextual 
influences  on  phonetic  segments,  as  evidenced  by  a  sane*  of 
spectrogram  reading  experiments*  it  was  found  mat  a  trained 
subieet  can  phonetically  transcribe  unknown  uttorancaa  from 
speech  spectrograms  with  an  accuracy  of  approxNiwtoty  88%. 
Thia  level  of  porformanca  «  far  better  man  tha  phonattc 
recognizers  reported  in  me  literature,  botn  w  terms  of  accuracy 
and  rank  order  statistics,  it  was  also  demonstrated  mat  tha 
t  of  spectrogram  reading  makes  us a  of  explicit  acoustic 
phonattc  rules,  and  that  mis  akHI  can  ba  learned  by  other*.  Thaaa 
results  suggest  that  tha  acoustic  s^nai  «  rich  in  phonetic 
information,  and  that  it  may  ba  poaaibla  to  obtain  substantially 
batter  parfonnanca  in  automatic  phonetic  recognition. 

Evan  with  substantially  improved  acoustic  phonetic  knowledge 
however,  an  approach  based  entirety  on  detailed  phonattc  analysis 
soil  has  serious  drawbacks.  Using  sotety  acoustic  information,  it  is 
of  tan  difficult  to  make  fina  phonetic  distinctions.  (For  txamota.  it  is 
difficult  to  dating  man  me  word  pair  "Sua/snoe"  reliably  across  a 


Mdi  rang*  of  speakers).  Furthermore.  the  application  of  context- 
dependant  rules  requires  aw  specification  of  tiw  comet  context 
(For  exampto.  aw  identification  of  a  retroftoxed  /t /  in  aw  word 
"tree"  dapanda  upon  cometty  identifying  the  retroflex  consonant 
/r/).  Thus,  recognition  based  solely  on  dataiiad  phonetic 
aanacription  of  an  unknown  utterance  may  not  be  desirable,  or 
even  poeekXe 


2.2  Constraints  on  Sound  Patterns 

The  sound  patterns  of  a  given  language  are  not  only  limited  by  tiw 
inventory  of  basic  sound  units,  but  also  by  aw  allowable 
combtnauone  of  tiwae  sound  units.  KnowWdge  about  aucti 
pnonotactic  constrains  is  piaounwbty  used  in  ^aoeti 
commumcsDon,  snev  n  ptovoh  imbw  vw***  win*  uiw  wMiy  *o 

JU  in  phonetic  details  mat  are  otherwise  not  available  or  are 
distorted.  Thus,  as  an  extreme  example,  a  word  such  aa  ’splint* 
can  be  recognized  without  having  to  specify  ate  detailed  acoustic 
characteristics  of  any  phoneme  other  than  the  /I/,  because  a  is  the 
only  word  m  die  Monism  Webster's  Pocket  Dictionary  (containing 
about  20.000  words)  that  satisfies  8w  foliowing  description: 

[cqnsonanticonsonant]pi[vow€li;nasalHSTop] 


While  aw  existence  of  phonotactic  constraints  is  wsH  known,  a 

recent  set  of  studies*7  provides  a  glimpse  of  aw  magnitude  of  tiwr 

predictive  power.  These  studies  asamino  aw  phonotactic 
cunsBakas  of  American  English  from  the  phonemic  diatrihutions  in 
aw  20.000- word  Merriam  Webster's  Pocket  Dictionary.  In  one 
study  aw  phonemes  of  each  word  ware  mapped  into  one  of  sbt 
broad  phonetic  cstogortoK  vowels,  slope,  nasals,  Squids  and 
gMOM*  arong  mcnvN,  m  wibi  inw.  i  nut,  rgr  mmiMt 
the  word  *speek*.  with  e  phonemic  siring  ghmn  by  /spit/,  is 
represented  ae  the  pattern; 

(STRONG  miCATIVEffSTOPKVOWeLKSTOPl 

wh  round  mas.  avan  as  oia  oroso  pnontoc  wfvw«  ipproxuniwy 
1/3  of  aw  words  m  aw  20.000-word  lexicon  can  be  uniquely 
irifWfd  Om  can  viiw  ttw  broad  ohooatic  dmtUcittoni  aa 

vnra  veer  levw  veveewree  pmvew^wrevw  ur^Mnm^^emw 

partitioning  the  lexicow  into  equivatence  cfaaaea  of  words  sharing 
aw  tarns  pnorwnc  cm  penem  ie.g..  im  raw  mw*  ><u 
’staap*  are  in  aw  sanw  equivalence  data).  Tha  avaraga  sue  of 
awae  squivelenca  deaaas  for  aw  20.000- word  lexicon  wee  found 
to  be  approximeiefy  2.  and  the  maximum  size  wao  spproxinwtafy 
200.  in  other  words,  m  aw  worst  cast,  a  broad  phorwtic 
reprseentMion  of  aw  word*  in  a  large  lexicon  reduces  tiw  number 
of  possess  word  candidates  to  about  1%  of  aw  lexicon. 
Furthermore,  over  halt  oi  the  lexical  items  belong  to  •owraience 
denes  of  size  9  or  toes. 


2.3  OosHng  with  Phonetic  Variability 

Tha  results  of  aw  Shipman  and  Zua  study  demonstrmta  that  broad 
phonetic  ctanAcaaons  of  words  can.  in  principle,  reduce  me 
number  of  word  candidates  swmfieanity.  However,  aw  acoustic 
reakrsrton  of  a  phono  can  os  highly  variable,  and  mis  variability 
introduces  a  good  deal  of  ambiguity  in  me  initial  classification  of 
tiw  speech  aqnat  At  one  extreme,  me  acousoc  characteristics  a! 
phonemes  can  can  undergo  simple  modifications  as  a 


consequence  of  contextual  and  interspeaker  differences,  such  as 
aw  differences  in  aw  acoustic  signal  for  aw  venous  ailOfl/lOflW  of 
l\l  in  words  "tea",  "tree".  and  "beauty".  At  aw  other  extreme, 
contextual  affects  can  also  produce  savers  modifications  in  which 
phonemes  or  syk totes  are  deleted  altogether.  Thus,  lor  example. 
Bw  word  "International"  can  have  many  pronunciations  including 
aw  deletion  of  aw  phoneme  /t/  and  tha  deletion  of  me  penultimate 
schwa. 

K  is  important  to  note  mat  awss  phenomena  can  occur  as  a 
consequence  of  aw  high  level  of  variability  in  natural  speech,  as 
wet  aa  resulting  from  an  error  by  aw  front-end  classifier  of  a 
speech  recognition  system.  Given  such  uncertainties,  one  may 
ask  whether  aw  original  results  of  Shipman  and  Zua  soil  hold  for 
lexical  access.  The  answer  is  partially  provided  in  a  recent  study 
conducted  by  Hutten tocher  and  Zua7.  in  which  they  observed  aw 
affects  on  lexical  constraints  after  introducing  various  amounts  of 
phorwtic  uncertainty  into  aw  lexicon.  They  found  tiwt.  even 
allowing  as  much  aa  20%  phorwtic  uncertainty,  lexical  constraints 
imposed  by  sequences  of  broad  phonetic  classes  are  still 
extremely  powerful.  Over  30%  of  aw  lexical  items  can  still  be 
uniquely  specified,  and  over  90%  of  aw  time  the  size  of  me 
equivalence  Class  is  S  or  lass.  On  ms  other  hand,  aw  maximum 
«w  of  the  equivalence  desses  grow  steadily  aa  aw  amount  of 
labeling  uncertainty  inctaaaaa. 

3.  SYSTEM  DESCRIPTION 
3.1  Overview 

Based  on  aw  results  of  aw  studies  a  ted  above,  we  propose  a  new 
phonetically- bleed  approach  to  isolated-word  recognition.  Thia 
approach  Is  distinctly  different  from  previous  attempts  in  that 
detailed  phonetic  analysis  of  aw  acoustic  signal  is  not  performed. 
Rather,  the  speech  bgnaf  is  cfessified  into  several  broad  manner 
categories  which  are  awn  used  directly  for  toxical  access.  The 
broad  phonabc  (man nap  .clmifiaf  sarvoa  aavaral  purpoaat>  Rrc 
errors  in  phonetic  labeling,  which  are  moet  often  caused  by 
detailed  phonetic  analyses,  should  tit  reduced.  Second,  by 
avoiding  fine  phonetic  distinctions,  aw  system  should  also  be  teas 
sensitive  to  interspsaker  van  aborts.  Finally,  sx  pen  mental  results 
Indicate  that  even  at  me  broad  phonetic  level,  sequential 
constraints  and  aw  toxical  distribution  can  limit  me  search  space 
substantially.  This  last  feature  to  particularly  important  when  me 
size  of  aw  vocabulary  to  large  (on  aw  order  of  several  thousand 
words  or  more). 

Once  aw  acoustic  bgnat  is  raducsd  to  a  sequence  (or  lattice)  of 
bread  phorwtic  segments,  the  resulting  representation  is  used  for 
lexical  iff  cast  The  intent  to  to  reduce  tiw  postobW  word 
candidates  to  a  very  small  sat  by  utilizing  knowledge  about  aw 
structural  constraints,  both  segmental  and  supraaeg mental,  of  the 
words.  Lexical  access  to  performed  by  indexing  into  an  "inverted 
lexicon",  where  each  word  is  stored  in  arms  of  its  bread  phonetic 
classification.  Since  aw  breed  pnonedc  classifications  specify 
such  small  subsets  of  aw  lexicon,  aw  resulting  sat  of  word 
candidates  should  be  vary  small.  From  this  word  sot  me  correct 
word  can  awn  bo  selected  through  mo  judicious  application  of 
dataiiad  phorwtic  knowledge. 

We  should  point  out  mat  me  system  uses  a  vary  conservative 
control  strategy.  Assertions  about  aw  acoustic  or  phonetic  identity 
of  a  speech  segment  are  only  made  when  aw  evidence  for  mat 
classification  «  vary  strong,  in  this  mmmwr.  a  reliable  gross 
acoustic  description  is  obtained.  Oataitod  classification  is  toft  until 


attar  lexical  access.  whan  specific  phonetic  hypotheses  can  ba 
formed  and  evaluated.  Thie  approach  can  siao  be  viewed  aa  *  tafll 
bimhwa  approach,  in  which  binding  a  segment  to  a  labai  (or  sat  oi 
labaia)  a  daiayad  aa  long  as  possible.  A  block  diagram  of  ths 
procaaang  performed  by  tha  phonetic  class  racognizar  la 
praaanted  in  Figure  t.  Tha  remainder*  this section  will  fallew  tha 
oudina  of  tha  block  diagram. 


WAVEFORM 

i 

PARAMETERIZATION 

1 

CHARACTERIZATION 

l 

ACOUSTIC  SEGMENTATION 

l 

PHONETIC-CLASS  SEGMENTATION 

1 

LEXICAL  ACCESS 


Figure  1:  Nock  Oiagram  of  8read  Phonetic  Oaaaificaaon 


3.2  Signal  Parameterizatlen 

Tha  parametariation  stags  consists  of  extracting  a  sat  of 
parameters  from  tha  acoustic  vmvefonn.  Tha  speech  signal  is 
sampled  el  16  kHz  and  passed  through  a  fitter  bank.  Thesnergyin 
each  band  «  than  caicuiatad  every  S  msec,  using  a  29  msec 
window,  in  general  tha  short- lima  vanaliona  in  thaaa  energy 
contours  are  overly  datadad.  since  many  of  tha  small  changes  in 
energy  are  irrelevant  to  tha  broad  phonetic  identification  task, 
ideally,  we  would  fike  to  preeawa  only  that  acoustic  information 
which  is  relevant  to  broad  phonetic  events.  However,  amply 
smoothing  tha  contour  over  a  long  time  window  is  not  appropriate, 
because  soma  short-term  events  are  important.  Therefore,  tha 
parameterization  stage  of  processing  is  designed  to  remove  tha 
irrelevant  information  xi  tha  energy  parameters  while  presenting 
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producing  a  stepwise  approximation  to  the  energy  contour,  where 
the  steps  correspond  to  longer  term  acoustic  events.  Tha 
approximation  preserves  the  area  of  the  original  energy  contour  by 
■.yaking  each  stop  the  same  area  as  tha  portion  of  tha  curve  within 
that  step.  Recant  wont  on  scale  specs  (Waring  is  also  concerned 
with  producing  auaMadve  descriptions  of  signals.  However,  that 
work  is  concerned  with  forming  such  descriptions  independent  of 
the  process  underlying  the  signal,  figure  2  contains  tome  energy 
contours  together  with  their  steowoe  approximation* 

3.3  Characterizing  tha  Parameters 

The  next  stage  of  processing  •»  concerned  with  producing 
sympatic  characterizations  of  the  steovnse  approximations  to  the 
energy  contours.  The  stepwise  approximations  preserve  two  kinds 
of  information:  magnitude  and  rotative  magnitude.  Similarly,  the 
symbolic  characterization  of  each  steo  is  in  terms  of  both 
magnitude  and  relative  magnitude.  The  magnitude  of  each  step  is 


Figure  2:  Energy  Contours  and  Their  Stepwise 
Approximations  for  the  Word  "length'* 


either  LOW.  MEDIUM  or  HIGH,  and  the  relative  magnitude  is  either 
PEAK,  VALLEY,  ONSET,  or  OFFSET.  These  magnitude  and 
relative  magnitude  characterizations  are  formed  by  taking 
advantage  of  the  natural  ordering  and  distribution  of  the  segments, 
rattier  than  using  arbitrary  thresholds.  For  example,  a  steo  *  said 
to  be  a  PEAK  if  its  magnitude  a  greater  than  that  of  bom  its 
neighbors.  Figure  3  presents  a  stepwise  approximation  and  its 
categorizations. 


Figure  3:  Categorization  of  the  an  Energy 
Contour  forme  Word  "length" 


3.4  Acoustic  Segmentation 

Given  pi#  characterizations  of  the  energy  in  various  frequency 
bends,  wu  need  some  way  of  combining  mis  information  in  order  to 
produce  an  acoustic  segmentation  of  me  speech  signal.  We  have 
chosen  to  combine  information  using  predicates  on  me  bandpass 
energy  characterizations.  For  example  (LOCAL-PEAK  TOTAL- 
ENERGY)  indicates  where  the  TOTAL-ENERGY  is  a  PEAK  with 
respect  to  Pie  neighboring  steps.  Similarly  (HIGH- AMPLITUDE 
TOTAL-ENERGY)  specifies  where  the  TOTAL-ENERGY  is  HIGH  in 
magnitude.  Acoustic  desses  are  men  defined  in  terms  of  simple 
combxi aborts  of  mass  predicates.  For  example,  lot  us  consider  a 
potential  rule  for  strongly  voiced  segments  (such  as  strong 
voweis): 

( OE  F I HE -AC OUST IC -CLASS  STRONGLY-VOICED 
(DURATI0N-8ETWEEN  30.  NIL 

(ANO  (HIGH-AMPLITUOE  lOW-f REQUENCY-ENERGY) 
(LOCAL-PEAK  TOTAL-ENERGY) 

( LOCAL-PEAK  LOU- FREQUENCY -ENERGY ) ) ) ) 

This  rule  states  mat  a  strongly  voiced  segment  must  have  a  lot  of 
low-freouency  energy,  and  mat  bom  total-energy  and  low- 
frequency  energy  must  oe  local  peaks.  The  rule  also  specifies  mat 
the  acoustic  segment  be  at  least  30  milliseconds  long  (with  no  limit 
on  ms  maximum  duration). 


The  control  structure  of  the  acoustic  eteaaiftar  is  baslcaUy  that  of  a 
smote  production  system,  whoro  zero  or  more  rales  may  Are  at  soy 
point  in  time.  That  is.  each  rule  is  triggered  when  the  input 
matehee  its  conditions.  This  kind  of  "free  response'*  control 
structure  has  the  advantage  of  not  forcing  a  segment  label  at  every 
point  teams,  nor  requiring  the  start  and  and  of  succaomws  labels  to 
match  exactly.  As  we  noted  above,  this  is  a  late  binding  strategy 
which  puts  off  labelling  a  segment  until  mere  is  good  evidence  ter 
mat  label.  Thus,  the  output  of  this  level  is  a  set  of  acoustic  labels 
which  can  be  overtopping  or  have  gape  between  them.  It  may 
seem  that  allowing  overlap  and  unaccounted  for  gape  will  cauas 


using  sequences  of  phonetic  daaeea.  In  the  end,  what  we  cars 
about  is  having  a  phoneticciaes  sequence  which  accurately 
preserves  the  identity  and  order  of  the  phones  present  in  the 
acousttc  signal.  We  do  not.  in  general,  care  exactly  whets  in  time 
the  desses  start  and  stop.  In  addition,  farcing  labels  at  each  point 
in  time  can  erroneously  introduce  faiae  segments  on  the  baaia  of 
poor  phonetic  information.  Those  “false  segment*  errors  wil 
produce  a  phonotic-dass  sequence  which  is  incorrect 


3.3  Phonetic  Classification  and  Lexical  Access 


The  final  stage  of  the  broad  phonatic  dasaWcaSon  uaaa  local 
acoustic  context  to  produce  a  sequence  of  phonetic  class  labels. 
This  is  done  using  a  rewrite  grammar,  which  maps  acoustic 
segments  and  contexts  onto  corresponding  phonetic  segments. 
For  example,  the  rate  lor  aspirated  stop  consonants  maps  silanes 
followed  by  weak  turbulence  to  an  aapratad  stop.  The  output  of 
this  steps  is  a  sequence  (or  lattice  in  the  case  of  multiple  poaaibla 
dasaificaOone)  of  broad  phonatic  cteaass.  The  broad  phonetic 
data  sequence  is  then  uaod  to  index  inio  the  20,000- word  lexicon. 
The  lexicon  «  stored  in  inverted  form,  with  each  entry  hashed 
according  to  its  broad  phonetic  cfaaMcHnn.  Therefore,  a  ategte 
hash  loenuo  produces  the  sold  words  matching  a  given  sequence 
of  broad  phonetic  clsaaea  in  the  cane  of  a  segment  lathee,  a  small 
number  of  hash  lookups  must  be  performed.  However,  the 
occurrence  of  lathees  «  limited  to  the  caee  whore  two  brood 
phonetic  daaaee  cannot  be  reliably  differentiated,  for  instance 
strong  fhcanvaa  versus  affricates  in  utteranca  initial  position. 
Figure  4  contains  the  broad  acoustic  and  phonatic  ctessiflestfana 
tor  the  ward  "tsngth"  uttered  by  s  mate  sposhar,  together  with  the 
toxical  entries  which  match  the  phonsOc  Urn  sequence 


4.  RESULTS  AMO  SUMMARY 


A  version  of  the  system  propoasd  in  thte  paper  hoe  been 
implemented  on  a  Uap  Machine  based  workstation  in  our 
laboratory.  Preliminary  results,  based  on  100  word  tokens  spoken 
by  two  mate  tafcera.  era  encouraging.  Over  88%  of  the  time,  the 
correct  word  wee  m  the  set  of  word  candidates.  For  example,  the 
word  “century*  spoken  by  a  mate  speaker  resulted  in  4  word 


osntnal 

century 


word  spoken  by  another  speaker  resulted  in  il 
In  both  of  these  examples,  me  correct  word  is 


Acoustic  Clssslf Icstlon: 


STRONGLY -VOICED  VOKE8AR  SUNCE  WEAK- TURBULENCE 


Phonatic  Classification 

(options)  segments  In  psrsns): 


(VOCED-STOP)  (YOKED)  STRONG- VOWEL  NASAL  WEAK-teflC 


Lexical  Entries  from  20,000-Word  Lexicon: 


8  entries: 


after  anth 
length 


Figures:  Acouetic  and  Phonetic  desemcatlone 
of  the  Word  “length* 


one  of  the  word  candidates.  We  are  continually  refining  the  rates 
for  acouetic  segmentation  and  phonetic  daasdicatian.  Our  gaol 
(Or  the  initial  lexical  teems  is  to  reduce  the  number  at  word 
candidates  to  leee  than  to  on  the  average,  with  an  error  rate  of  lees 
than  9%. 


In  summary,  this  paoer  presents  s  new  approach  to  the  problem  of 
recognising  isolated  words  from  large  vocabularies  and  multiple 
spellers.  The  system  initially  daeaHaa  the  acouetic  signal  into 
several  broad  manner  categories.  Once  the  set  of  potential  word 
candidates  has  been  sgmficantty  reduced  through  the  utilization 
of  the -structural  constraints,  the  acoustic  differences  between  the 
remaining  wertie  can  be  examined  in  detail.  Such  a  procedure  wM 
enable  ue  to  deal  with  the  large  vocabulary  recognition  problem  in 
an  oMctent  maimer,  what  a  even  more  important  •  the  fact  that 
such  sn  approach  bypaaaaa  the  highly  error-prona  procaas  of 
dartving  a  comptete  phonatic  transcription  from  tha  acoustic 
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can  be  applied  in  a  fop-down  verification  mode,  where  the  exact 
phonetic  context  can  be  specified. 
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